Programming for FinTech

Module 3: R

Prof. Matthew G. Son

University of South Florida

R Programming Basics

Basic Calculations

You can use R to do basic math calculations.

1 / 200 * 30
[1] 0.15
(59 + 73 + 2) / 3
[1] 44.66667

Question: What is 500 * (100 / 2.5) + 770?

Arithmetic Operators

  • Addition (+), Subtraction (-)
  • Multiplication (*), Division (/)
  • Exponentiation (^ or **)
  • Modulo (%%): Returns the remainder of a division
5 %% 3
[1] 2
  • Integer Division (%/%): Returns the integer quotient
8 %/% 3
[1] 2

Logical Test Operators

  • Less than (< or >)
  • Less than or equal (<= or >=)
  • Equality (==) and Inequality (!=)
  • Logical NOT (!)
  • Element-wise AND (&), OR (|)

Exercise: Check

  • 3 >= 5,
  • TRUE == FALSE,
  • TRUE & FALSE,
  • (3 > 1) & (3 <= 5)
  • TRUE | FALSE,
  • !TRUE == FALSE
  • !(3 > 1) & (3 <= 5)

Comments

Comments are meant to explain the code and redable, but ignored by the computer.

  • Use # to make comment
  • Anything written after # on the line are ignored
# this is comment
1 + 2 # This is another comment
[1] 3

Tip

Use Ctrl (Cmd) + / hotkey to toggle comments.

Scalar

A scalar in R is a simplest data type and represents a single element, not a collection.

10 # Number
"New to R" # Character
TRUE # logical

Atomic Vector

To combine multiple elements to a vector, use c() function.

c(1, 2) # 2 elements
[1] 1 2
c(2, 3, 4, 6, 7) #5 elements
[1] 2 3 4 6 7
c(3) # 1 length vector, same as scalar 
[1] 3

Tip

In R, “scalar” is actually represented as a vector of length 1.

Integer sequence

Colon : generates integer sequence from:to.

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
-5:2
[1] -5 -4 -3 -2 -1  0  1  2
3:-5
[1]  3  2  1  0 -1 -2 -3 -4 -5

Length of a vector

To check the number of elements of a vector, use length() function:

length(3) # Scalar is length 1 vector
[1] 1
length(c(5))
[1] 1
length(c(1, 2, 3))
[1] 3

Vectorized operations

Vectorized operations mean vector-in-vector-out, as opposed to scalar operation (one-in-one-out).

In R, most operations are vector in mind.

For example, + and * are vectorized operators.

c(1, 3, 5) + c(2, 3, 4) # plus on each element of vector
c(5, 6, 7, 8) * 3 # multiply each (recycle x3 to all elements)

Question:

What do you get when c(2,3) * c(3,9)?

What’s the length of the output?

Recycling rule

When the operands are of different lengths, the shorter one is recycled as many times as necessary.

# Try !
1:10 / 3
1:10 * c(-1,1)
5 ^ c(1,3,5)

When it cannot be recycled entirely, it still works but raises a warning message:

# Try !
c(1, 10, 100) * 1:5

Why vectorized?

Note

Vectorized operation is much, much faster than iterating (looping) over each single (scalar) elements.

Avoid using loops and utilize vectorized operation whenever possible.

For computers, using vectorization or not is like a difference in our mental calculation between

  • 3 * 9 and
  • (3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3).

Exercises

  1. What is wrong with c(1 + 1, 2 + 1, 3 + 1)? How can you make it better?

  2. Answer how c(1,2) * 1:3 works.

REPL vs Scripting

REPL: Interactive programming

So far, we’ve done interactive programming, REPL:

  • Read, Evaluate, Print, Loop
  • For rapid prototyping, exploration, debugging, etc.

Continuation Prompt

> cat("Here's command I'm typing
+ but it was unfinished
+ so R waits until I finish expression...
+ 
+ hitting Enter doesn't abort at all..
+
+ press ESC or CTRL+C to abort 
")

On console: > means: “Waiting your command”

+ means: “Continue command”

CTRL + C to abort.

Script (Batch)

Instead, you can run the whole script outside of R environment, using Rscript

# from outside of R environment (e.g., bash)
Rscript my_code.R

Running a script file on current R session:

# from inside of R environment
source("my_code.R")

Writing a complete script is your final goal in programming.

  • For production and deployment

Tip

REPL for development, and script for production.

To run the (whole) script, Ctrl (Cmd) + Shift + Return

Binding Names (Symbols)

Use <- to bind a name (symbol) to an object.

# object of 3 is created, bind its name to my_number
my_number <- 3 
print(my_number * 3)
[1] 9

Here, my_number is called symbol, or name of an object.

Tip

Style guide: though you still can use =, use <- for assignment.

Use = for specifying function arguments instead.

Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.

R has strict rules about a syntactic name (symbol).

  • It is case sensitive
a <- 3
print(A)
Error: object 'A' not found
  • It cannot contain whitespace
my number <- 1 # error
Error in parse(text = input): <text>:1:4: unexpected symbol
1: my number
       ^
  • It cannot start with numbers
my_number_1 <- 15
1_my_number <- 10 # error
Error in parse(text = input): <text>:2:2: unexpected input
1: my_number_1 <- 15
2: 1_
    ^
  • You can’t use reserved words like TRUE, NULL, if, etc.

  • If you’d deliberately use non-syntactic names, use backtick `

# Use backtick escaping only when you have to
`my number` <- 1
print(`my number`)
[1] 1

Object naming conventions

Since objects cannot contain whitespace as symbol, there are some naming conventions.

  • snake_case

  • camelCase

  • PascalCase

when_you_have_very_long_name_object <- 12
YouCanNameItLikeThis <- 1

Tip

It is better to make a short, self-explanatory name.

e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15

Interactive prompt

readline() gemerates prompt for interactive input.

readline("What is your age?")

The response can be a value and assigned as an object.

response_age <- readline("What is your age?")

In class Exercise

  1. Create a vector object of numbers: 98,99,100,101,102
  2. Assign above to a symbol bond_prices.
  3. If you follow CamalStyle naming convention, what would it be?
  4. Generate an interactive prompt that asks interest rate, and assign it with symbol interest_rate

Evaluating vs Assigning

Consider the following code. What is the printed value of a?

a <- 3.5
a * 2
print(a)

The expression a * 2 is evaluated but not assigned to any variable, so a remains unchanged.

To store the result, you need to assign it:

a <- a * 2 # overwrite or
b <- a * 2 # assign to other symbol

R Data Types

Object types in R

  • Vector type: Common data types

  • Special type (non-vector): non-vectors

    • Functions, Environments, etc.

Vectors are the most important family of data types in R.

Vector type

Vector is a data structure that stores multiple elements. It comes in two flavors:

  • Atomic vector: all elements same type
  • Generic vector: known as list, can have different types of elements

NULL is not a vector, but often serves as zero length vector.

Atomic vectors

There are four primary types of atomic vector in R, and two others.

Type of Atomic vectors

  1. Logical (or Boolean): TRUE, FALSE, NA
logical_test <- 3 > 5
print(logical_test)
[1] FALSE
typeof(logical_test)
[1] "logical"
  1. Integer: integer numbers

Attach L to treat the number as strict integer number.

my_integer <- 3L # L specifies the number is integer
typeof(my_integer)
[1] "integer"
  1. Double: real numbers
my_number <- 3.125 # A length of 1 vector
typeof(my_number)
[1] "double"

Caution

numeric is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.

  1. Character (or string): words, wrapped by " or '
my_name <- "Matthew Son" # wrap double quotes around
korean_name <- 'Gunsu Son' # or single quotes
typeof(my_name)
[1] "character"
typeof(korean_name)
[1] "character"

Tip

Style guide: Use double quote " for character instead of ' if possible.

  1. Two other types:
  • raw type: binary data type

  • complex type: complex numbers (e.g. 3 + 4i)

  • rarely needed in Finance

Missing values: NA

  • Missing values are denoted by NA
    • Not Applicable: similar to “undefined” above
  • They are not identical to zero or NULL
    • NULL is intentional empty “placeholder” in R

NA is considered as logical length 1 vector.

typeof(NA) # "logical"
length(NA) # 1

NULL is a special type (NULL), length 0

typeof(NULL) # "NULL"
length(NULL) # 0

NaN a numeric missing value, length 1.

  • unrepresentable numeric results (e.g., 0/0, log(-1)).
typeof(NaN) # "double"
length(NaN) # 1

Exercise

  1. What are four primary types of atomic vector?

  2. What are the types of a,b,c,d below?

a <- TRUE
b <- 3.56
c <- "Logical"
d <- 6

Confirm your answer with typeof().

List

List is a generic vector that is not atomic.

  • Atomic vector can have only one type for its elements (Double, Integer, Logical, …)

  • List can hold multiple data types for its member (even list itself)

example_list <- list(1L, 3.5, 'Hi', TRUE)
print(example_list)
[[1]]
[1] 1

[[2]]
[1] 3.5

[[3]]
[1] "Hi"

[[4]]
[1] TRUE
typeof(example_list)
[1] "list"

List can have atomic vectors as its elements, with varying lengths:

# List construction example
# Use "=" inside function
my_list <- list(
  first_element = c(1L, 2L, 3L), # integer type, length of 3
  second_element = c('how', 'what', 'why', 'where'), # character type, length 4
  third_element = c(3 < 5, 3 == 5, 3 > 5) #logical type, length 3
)
typeof(my_list)
[1] "list"

Question: What is the length of my_list?

R Object Attributes

Attributes

Attributes are metadata that is attached to R objects, providing additional information or functionality.

Common attributes:

  • Names: Labels for elements in a vector or list

  • Dimensions (dim): Used for matrices, arrays

  • Class: Defines how an object should be treated by functions

  • etc.

Names attribute

Elements of vector (atomic, generic) can be named.

stock_prices <- c(150, 200, 250, 300)
attributes(stock_prices) # No attributes
NULL

There are roughly three ways to assign names attribute.

Method 1: names()

In R, often used attributes has its own access function named after its own, such as names(), class(), dim().

names(stock_prices) <- c("AAPL", "GOOG", "MSFT", "AMZN")
attributes(stock_prices) # stock_prices now have "names" attribute
$names
[1] "AAPL" "GOOG" "MSFT" "AMZN"
print(stock_prices)
AAPL GOOG MSFT AMZN 
 150  200  250  300 

Method 2: attr()

Or use attr() function to set attribute:

# method 3
stock_prices <- c(150, 200, 250, 300)
attr(stock_prices, "names") <- c("AAPL", "GOOG", "MSFT", "AMZN")
attributes(stock_prices)
$names
[1] "AAPL" "GOOG" "MSFT" "AMZN"

Method 3: by construct

Or assign names by construct:

# method 2
stock_prices2 <- c(
  "AAPL" = 150,
  "GOOG" = 200,
  "MSFT" = 250,
  "AMZN" = 300
)
attributes(stock_prices2)
$names
[1] "AAPL" "GOOG" "MSFT" "AMZN"

Exercise

stock_prices <- c(100, 150)

Give names attribute to stock_prices vector using aforementioned three methods.

  • names(obj) <-
  • attr(obj, "names") <-
  • by construct

Dim attribute

Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.

# matrix
x <- 1:6
dim(x) <- c(2, 3) # assign 2 by 3 dimension attribute
print(x)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
# array
y <- 1:8
dim(y) <- c(2, 2, 2) # 2 by 2 by 2
print(y)
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

Class attribute

The class attributes in R is used to define the behavior of objects with functions.

Especially important classes in Finance are:

  • Date, Time
  • Factors
  • Dataframe
  • or your own custom-built class

Example 1: Date/Time

Very important class in Finance.

They are built from double type atomic vector (type), but has own specific rule for uses (class).

today <- Sys.Date() # returns today's date
now <- Sys.time() # returns time now
print(today)
[1] "2026-02-08"
print(now)
[1] "2026-02-08 15:43:35 EST"

Check their data type:

typeof(today)
[1] "double"
typeof(now)
[1] "double"

Check their attributes: they have class attributes.

attributes(today)
$class
[1] "Date"
attributes(now)
$class
[1] "POSIXct" "POSIXt" 

To directly access the class attribute:

class(today)
[1] "Date"
class(now)
[1] "POSIXct" "POSIXt" 

Class attribute and change of behavior

For an example, see how it works with + function.

print(today + 1) # adds one day
[1] "2026-02-09"
print(now + 1) # adds one seconds
[1] "2026-02-08 15:43:36 EST"

Q: Why +1 yield different results?

A: Because they are in different classes. +1 is inferred differently.

Class attribute gives some context how it should behave with functions.

In the deep down, they are just numbers:

# strip off class attribute (only)
unclass(today)
[1] 20492
unclass(now)
[1] 1770583416
  • Date: The value of double represents the number of days since “1970-01-01” (Unix Epoch)
  • Time: the number of seconds since Unix Epoch

Time zone attribute

Time has “tzone” attribute (time-zone) that controls “formatting” of date-time.

  • The lower-level data (double) for the time is not changing.
now <- Sys.time()
print(now) # EDT time zone
[1] "2026-02-08 15:43:36 EST"
unclass(now) # a double number
[1] 1770583416
# set attribute tzone to UTC
attr(now, "tzone") <- "UTC"
print(now) # formatted for UTC zone
[1] "2026-02-08 20:43:36 UTC"
unclass(now) # same underlying number
[1] 1770583416
attr(,"tzone")
[1] "UTC"

Example 2: Factors

Factors (or Categorical) can only have a set of predefined values.

  • It is built on top of integer type
# Factors can be ordered 
firm_size <- factor(
  c("Large", "Mid", "Small"),
  levels = c("Small", "Mid", "Large"), # predetermined values
  ordered = TRUE # give increasing order to above level
)
typeof(firm_size)
[1] "integer"
class(firm_size)
[1] "ordered" "factor" 
# Factors are unordered by default
political_parties <- factor(
  c("Democratic", "Republican")
) # unordered
typeof(political_parties)
[1] "integer"
class(political_parties)
[1] "factor"

If they were stripped off all attributes:

attributes(firm_size) <- NULL # remove all attributes
firm_size # integer (it gave greater number for higher level)
[1] 3 2 1

Base types and Class

Example 3: Dataframe & tibble

A class built on top of list type, with 2D tabular representations

  • Similar to matrix (2D form)
  • Dataframe and tibble can have different types of columns
# iris is available dataframe
head(iris) # prints first 6 rows
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
typeof(iris) # list type
[1] "list"
class(iris)
[1] "data.frame"

If class attribute was removed: it turns back to list

# Try !
iris_list <- unclass(iris)
print(iris_list)

Let’s browse the attributes of iris dataframe:

# names (column names), class, row.names
attributes(head(iris))
$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

$row.names
[1] 1 2 3 4 5 6

$class
[1] "data.frame"

Tibble is a robust dataframe class.

  • It has better printing output than data.frame
  • Convert class with as_tibble()
iris_tb <- as_tibble(iris)
iris_tb |> head(3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
iris_tb |> typeof()
[1] "list"

Implicit Class

Those “building-block” types in R have implicit class:

  • Atomic vectors: (primary data types - Logical, Integer, Double, Character)
  • Generic vector (list)
  • Matrix / Array
  • etc.

Their class is not shown in the attributes(), but still shown when explicitly asked with class()

a <- TRUE
typeof(a) 
class(a) 
attributes(a) # no class attribute printed

b <- 3L
typeof(b)
class(b)
attributes(b)

c <- 3.5
typeof(c)
class(c) # double type are implicit class "numeric"
attributes(c)

d <- "A"
typeof(d)
class(d)
attributes(d)

Exercise

  1. Execute typeof(c(1,2,3)) and typeof(c(1L, 2L, 3L)). What’s the difference?
  2. Assign a vector with three elements: 1,3,5, and name it as my_first_object
  3. Assign another object with one element: 5, name it as MySecondObject
  4. Multiply my_first_object with MySecondObject. What do you get?
  5. Assign a vector with your name: my_name
  6. What do you get when you execute my_name + 3? (Expect error)

Importance of class

Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.

Note

Calling a function means executing/applying a function.

As you cannot use add function on character and numeric.

# Another way to perform plus
`+`(3, 4) # good
[1] 7
`+`(3, "Hi") # error
Error in 3 + "Hi": non-numeric argument to binary operator

Other example: c()

Some functions coerces the class / type instead.

my_vector <- c(1, 3, 'hello', 5) # Does it work?
typeof(my_vector)
[1] "character"
class(my_vector) # it works by coercion
[1] "character"

Tip

c() does coercion when inputs are in different types, but not all functions do coercion.

Exercise: coercion

Let’s see how coercion works with c().

test <- c(3.14, Sys.Date(), Sys.time())

What is the type of test?

What is the class of test?

Coercion precedence

In general, coercion is performed in fixed order:

Character (heavy) << Double << Integer << Logical (light)

typeof(c(TRUE))
[1] "logical"
typeof(c(TRUE, 2L))
[1] "integer"
typeof(c(TRUE, 2L, 3.14))
[1] "double"
typeof(c(TRUE, 2L, 3.14, "A"))
[1] "character"

Quick Check

What would be the type of:

typeof(c(TRUE, 3.55))
typeof(c(3L, 5.3))
typeof(c(3L, "2.15"))
typeof(c(TRUE, 3L, 3.34, "2.15"))
typeof(c(TRUE, "FALSE"))

Data type and memory footprint

Given the same length, logical and integer takes least amount of memory, then double, then character

lobstr::obj_size(c(TRUE, FALSE, TRUE)) # uses integer for memory footprint internally
64 B
lobstr::obj_size(c(3L, 15L, -5L))
64 B
lobstr::obj_size(c(3.12, 53.1, 6.22))
80 B
lobstr::obj_size(c("A", "B", "C"))
248 B

R Access Operations

Vector Indexing

  • Single square brackets [ and ]

    • Subset multiple elements from vector
  • Double brackets [[ and ]]

    • Subset single element (scalar) from vector

Vector indexing: Single brackets

Square brackets [ selects multiple elements of vector

  • by index: x[1] retrieves first element
  • by name: x[“Bob”] retrives element named “Bob”
prices <- c(AAPL = 150, MSFT = 205, GOOGL = 250, AMZN = 303)

# Access by name attributes

prices[["AAPL"]] # single (double bracket)
[1] 150
prices[c("MSFT", "GOOGL")] # multiple (single bracket)
 MSFT GOOGL 
  205   250 

Elements can be accessed with index:

  • Positive integer: select
  • Negative integer: exclude
# Access by index
prices[1] # first element
prices[c(1, 3)] # 1st and 3rd element
prices[-c(1, 4)] # excluding first and fourth element
  • You can’t mix positive and negative index
prices[c(-1, 5)]
Error in prices[c(-1, 5)]: only 0's may be mixed with negative subscripts

Vector indexing: Double brackets

Use double brackets [[ on vector when you want to select single element (Scalar).

Tip

Style guide: while single bracket on vector still works, use [[ on vectors to reinforce your expectation.

# single value from an atomic vector
prices[1] # Ok (keeps names)
AAPL 
 150 
prices[[1]] # Better
[1] 150

Subset & Assignment on Vector

Subsetting vector can be combined with assignment <- to modify selected values.

print(prices)
 AAPL  MSFT GOOGL  AMZN 
  150   205   250   303 
prices[[1]] <- 235 # use double bracket for single element selection
print(prices)
 AAPL  MSFT GOOGL  AMZN 
  235   205   250   303 
prices[["AMZN"]] <- 295
print(prices)
 AAPL  MSFT GOOGL  AMZN 
  235   205   250   295 

Assigning multiple elements:

# Use single bracket for multiple elements
prices[c(2,4)] <- 265 
print(prices)
 AAPL  MSFT GOOGL  AMZN 
  235   265   250   265 

Exercise

  1. Generate stock price vector:
stock_prices <- c("MSFT" = 200, "TSLA" = 255, "AAPL" = 230, "GOOG" = 215)
  1. Access 1st and 3rd element of stock_prices

  2. Assign “MSFT” and “AAPL” to 300.

You can subset vector with logical vector inside brackets:

prices[prices > 230] # subset price greater than 200
 AAPL  MSFT GOOGL  AMZN 
  235   265   250   265 
(prices %% 2) == 0 # a logical vector : TRUE if even number
 AAPL  MSFT GOOGL  AMZN 
FALSE FALSE  TRUE FALSE 
is_even <- (prices %% 2) == 0
prices[is_even] # subset of x that is even number
GOOGL 
  250 
prices[(prices %% 2) == 0] # is the same as above
GOOGL 
  250 

Exercise

Generate below stock prices:

stock_prices <- c(TSLA = 680, FB = 355, NFLX = 540, NVDA = 800, AAPL = 145)

Subset and get the following:

  • Stocks whose prices greater than $250
  • Subset stocks whose prices that are even
  • Modify prices greater than $600 to $360

List Indexing

There are 3 ways to index lists that each has own merits:

  • Single square brackets [ ] : returns original (list) type
  • Double square brackets [[ ]] : returns element’s type
  • Dollor sign operator $

A list is like a train carring multiple cars:

x <- list(1:3, "a", 4:6)

List Indexing: Single brackets

Single bracket returns a list object, train.

x[1]
[[1]]
[1] 1 2 3

List Indexing: Double brackets

Double bracket returns element’s type, car.

x[[1]]
[1] 1 2 3

List Indexing: $ Operator

$ is a shorthand operator for double bracket [[ with a variable name.

  • Access variable without quotes (")
  • Autocompletion friendly
# They are (roughly) equivalent
x[["var1"]]
x$var1

Portfolio example

Construct a list of portfolio:

portfolio <- list(
  stocks = c(AAPL = 150, TSLA = 680),
  bonds = c(TBOND = 1000),
  cash = 5000,
  brokerage = 'Robinhood'
)

Subset by number index:

portfolio[[1]] # double bracket: returns vector (element type)
portfolio[1] # single bracket: returns list

Subset by name:

portfolio[["bonds"]] # double bracket: returns vector
portfolio["bonds"] # single bracket: returns list

portfolio$bonds # dollar sign: returns vector

Assignment on Lists

You can use chained bracket operation and assignment:

portfolio[["stocks"]][["TSLA"]] <- 630 # replacement
portfolio[["bonds"]][["TBOND5Y"]] <- 210 # new value
print(portfolio)
$stocks
AAPL TSLA 
 150  630 

$bonds
  TBOND TBOND5Y 
   1000     210 

$cash
[1] 5000

$brokerage
[1] "Robinhood"

You can remove a component of list by assigning NULL

portfolio[["brokerage"]] <- NULL
print(portfolio)
$stocks
AAPL TSLA 
 150  630 

$bonds
  TBOND TBOND5Y 
   1000     210 

$cash
[1] 5000

Element removal on atomic vectors

Assigning NULL to an element of a vector doesn’t work:

prices <- c(100, 200, 300)
prices[[1]] <- NULL # Doesn't work
Error in prices[[1]] <- NULL: replacement has length zero

Should use negative indexing & overwrite in this case:

prices <- prices[-1] # except for first element
print(prices)
[1] 200 300

Exercise

Create a list object portfolio as:

portfolio <- list(
  stocks = c(AAPL = 150, TSLA = 680),
  bonds = c(TBOND = 1000),
  cash = 5000,
  brokerage = 'Robinhood'
)

Subset the portfolio to:

  1. Retrieve the stocks vector using index
  2. Retrieve the cash vector using its name
  3. Return a list of stocks and cash using their names
  4. Return a list containing only bonds
  5. Remove brokerage element from portfolio

R Functions

Functions

Everything that exists is an object.
Everything that happens is a function call.

John Chambers

Function calls

Fuction calls in R come in four varieties.

  1. prefix: function name comes before its arguments
# Prefix example
function_name(argument1 = value1, argument2 = value2, ...) function_body
  1. infix: function name comes in between arguments
# Infix example
x + y
  1. replacement: function that replaces value by assignments
# Replacement example
names(stock_prices) <- c("AAPL", "BAC", "CA")
  1. special: Built-in R syntax like [[ for if, and don’t have consistent structure.
# Special example
stock_prices[[1]]

Rewriting to prefix form

An intersting property of R is that every inflix, replacement, and special form can be rewritten in prefix form.

  • use backtick ` to wrap the function symbol
1:5 # infix call
`:`(1, 5) # prefix call

x <- c(1, 2) 
names(x) <- c("A", "B") # replacement call
x <- `names<-`(x, c("C", "D")) # prefix call

x[[1]] # special call
`[[`(x, 1) # prefix call

Our first prefix function call, c() concatenates all the values and generate a single object.

  • c() has arbitrary number of arguments (...)
# c() function takes arbitrary number of arguments (... part)
c(1, 3, 5) 
c(1, 3, 5, 7, 9)

seq() function generates a sequence of numbers.

  • It has three arguments: from and to and by.
# generate a vector of sequence from 1 to 10 by 2
seq(from = 1, to = 10, by = 2)
[1] 1 3 5 7 9
  • If users don’t specify the argument name, it reads input in order.
seq(1, 10, 2) # reads first three
[1] 1 3 5 7 9
  • If user enters more args than the function space, raises error.
seq(1, 10, 2, 5)
Error in seq.default(1, 10, 2, 5): too many arguments

Some functions has pre-defined argument value:

# max(..., na.rm = FALSE)
max(c(1,2,3,NA)) # NA
max(c(1,2,3,NA), na.rm = TRUE) # 3

Defining a function

You can define custom function (User-defined function) in R with the following syntax:

# simple form
function_name <- function(x, y) function_body

# full form
function_name <- function(arg1, arg2) {
  # write what the function would to with arguments
  return() # output
}

The function can be called in prefix form:

function_name(argument1 = 1, argument2 = "hi")

Default argument value can be assigned by construction:

my_first_function <- function(argument1 = NULL, argument2 = "hi", arg3) {
  
  print(argument2)

}

Q: What would happen if user calls above function with

  • my_first_function()

Curly Braces

By default, R evaluates each line as in individual statement.

square_function <- function(x) x^2

Using curly braces { } allows you to group multiple expressions into a single unit that executes together.

some_other_function <- function(x) {
  squared <- x^2         
  result <- squared + 10  
  return(result)         
}

Exercise

Prep: Load stringr package with library(stringr)

  1. Write your own function say_hello() that takes no argument. It prints “Hi!” when called.

  2. Now tweak the function to accept an argument, name. It prints “Hello, {name}!”

  • Use str_glue("Hello, {name}!")

Returning Values in Functions

return() in function can be served as an early exit: all remaining code won’t be executed.

my_second_function <- function(a, b = 1) {
  
  c <- a + b  
  return(c)

  print(a)
}
my_second_function(a = 1, b = 2)
[1] 3

Function Fundamentals

A function has three parts:

  1. The formals(): list of arguments

  2. The body(): code inside the function

  3. The environment(): where you defined the function

    • “GlobalEnv” : top-level workspace in the R session
some_function <- function(x, y = 1) {
  i <- 1
  ans <- x + y + i
  return(ans)
}

Check formals(), body() and environment():

# formals: list of arguments
formals(some_function) # Is i included? check.
# body: code inside of function
body(some_function)
# environment: where is the function object located
environment(some_function)

Some functions are found from external packages:

environment(stringr::str_glue) # glue function from stringr package

Getting Help on Functions

All R functions are built by someone, and documentation is typically provided.

For detailed description of any function, use ? followed by the function’s name.

For example, try below code in your console:

?seq

Or, use help()

help(seq)

Build a Perpetuity Calculator

The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:

[ PV_{PER} = ]

where

  • PMT is the payment or cash flow.
  • r is the discount rate.
  • g is the growth rate of the cash flow.

This formula applies when r > g.

Defining a function

You can design your perpetuity function in R with following syntax:

pv_per <- function(pmt, r, g) {
  # write what the function would to with arguments
  n <- pmt
  d <- r - g

  # then return the result
  return(n / d)
}

Calling a function

Let’s call the function above:

  • What is the PV of perpetuity, when PMT = $10,000, r = 7% and g = 3%?
pv_per(pmt = 10000, r = 0.07, g = 0.03)
[1] 250000
  • Assign the result value of the function
answer <- pv_per(10000, 0.07, 0.03)
print(answer)
[1] 250000
  • Vector can be the input (vectorized)
pv_per(c(100, 1000, 10000), 0.07, 0.03) # r, g recycled
[1]   2500  25000 250000
pv_per(c(100, 1000, 10000), c(0.07, 0.05, 0.04), 0.03) # g recycled
[1]    2500   50000 1000000

Exercise

  1. Define a perpetuity calculator function, pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?

  2. What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?

Default Arguments

What happens if user doesn’t specify one argument?

pv_per(10000, 0.07) # g is missing!
Error in pv_per(10000, 0.07): argument "g" is missing, with no default

You can set default values for arguments, allowing them to be omitted when calling the function.

# set default g to be zero
pv_per2 <- function(pmt, r, g = 0) {
  n <- pmt
  d <- r - g
  # then return the result
  return(n / d)
}
# prefix call without g specified
pv_per2(10000, 0.07)
[1] 142857.1

Example 2: Black-Scholes Pricing

Functions can do more complex calculations. Following the Black-Scholes put / call pricing formula, we can generate function as below:

bsm_price <- function(S0, K, r, T, sigma, type = "call") {
  d1 <- (log(S0 / K) + (r + 0.5 * sigma^2) * T) / (sigma * sqrt(T))
  d2 <- d1 - sigma * sqrt(T)
  
  if (type == "call") {
    return(S0 * pnorm(d1) - K * exp(-r * T) * pnorm(d2))
  } else if (type == "put") {
    return(K * exp(-r * T) * pnorm(-d2) - S0 * pnorm(-d1))
  } else {
    stop("Invalid option type. Use 'call' or 'put'.")
  }
}

Calculate price estimates with four scenarios:

S0 <- c(100, 105, 110, 115)   # Stock prices
K <- c(100, 100, 100, 100)    # Strike prices
r <- 0.05                # Risk-free rate
T <- c(1, 0.5, 2, 0.05)        # Time to maturity
sigma <- c(0.2, 0.25, 0.3, 0.4) # Volatility

bsm_price(S0, K, r, T, sigma) # vectorized operation!
[1] 10.45058 11.47739 28.31895 15.47719

Anonymous function

# Anonymous function example
function(x, y) {
  return(x + y + 1)
}

Functions are typically named so they can be reused multiple times.

However, you can skip naming a custom function, and they are called anonymous function.

  • Useful when the function is simple and called only one time.

They are not stored as objects since they do not have assigned symbols (names).

Syntactic sugar: Function

Note

Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.

(Anonymous) functions can be defined with syntactic sugar (concise expression):

\(x, y) x + y + 1

\(x, y) {
  x + y + 1
} # or with curly braces

Exercise

Convert below perpetuity function (pv_per) to anonymous function:

pv_per <- function(pmt, r, g) {
  return(pmt / (r - g))
}

Syntactic sugar: Pipe Operator

A Motivating example

Solve below math problem. Describe your steps. What was the first and the last step?

\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]

\[ \sqrt{(2+4)^2 - 3 * 4} \]

  1. Do 2+4 and then square it, and save it in your memory
  2. Do 3*4 and then subtract it from previous, and update your memory
  3. and then square root the value
  • Similarly, codes can be written not in the order we calculate.

  • It is easier for us to read & write code in the order it is operated.

When we have composite function calls such as

f(g(h(k(x))))

The call sequence is x -> k() -> h() -> g() -> f().

It is rather easier to read, write and debug if we can write a code like:

# Note: this is not a real code!
x then
  call k() on above and then
  call h() on above output and then 
  call g() on above output and then
  call f() on above output

Pipe operator & Function Chain

This is where pipe operator |> becomes handy in R.

The pipe operator does “and then” job, and it can be written as:

x |>
  k() |>
  h() |>
  g() |>
  f() # Do not put pipe at the end!

Tip

Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.

Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.

Exercise (challenge!)

Solve \(\sqrt{2^3}\) using pipe operator.

  1. First, solve above procedual way
  • For square root, use sqrt() function
  1. Next, solve using the pipe operator.
  • Define function named cube that does x^3
  • Code should start with 2.

External packages

Packages are add-on libraries that extend the functionality of R.

  • They provide additional functions, datasets, and tools for various tasks
  • Can be easily installed with install.packages()
  • And loaded in R session with library()

Installing packages:

# Installs tidyverse package if not installed
# Once installed, you don't need to install again
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
}

Load packages: you need to load packages to use its functionality.

  • Need to load only once per session
library(tidyverse)

Control Structure

Control Structure?

Control structure dictates which code gets executed and when.

  1. Conditional Statements:
    • if statements: Execute code if a condition is true.
    • else/else if statements: Execute code if the condition is false.
  2. Loops:
    • for loops: Repeat code block a specified number of times.
    • while loops: Continue executing code as long as a condition is true.
  3. Map (apply):
    • Map a function to each element of a collection without explicitly writing loops.

If-else

The basic form of if and if-else statement in R:

# Simple form
if (condition) true_action
if (condition) true_action else false_action

# Full form
if (condition) {
  true_action
} else {
  false_action
}
# Simple form
x1 <- if (TRUE) 1 else 2
# Full form
x2 <- if (FALSE) {
  1 
} else {
  2
}
print(c(x1, x2))
[1] 1 2

Example 1: if and else executes code based on logical conditions.

stock_price <- 115
if (stock_price > 110) {
  print("The stock price has increased significantly!")
}
[1] "The stock price has increased significantly!"

Example 2: If condition is not met, then nothing happens (skipped).

stock_price <- 100
if (stock_price > 105) {
  print("The stock price has increased!")
}

Example 3: else if checks one more logic condition:

stock_price <- 115
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
}
[1] "The stock price has increased moderately."

Example 4: There can be multiple else if

stock_price <- 108
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
} else if (stock_price > 105) {
  print("The stock price has increased slightly.")
}
[1] "The stock price has increased slightly."

Example 5: else is executed when all of if conditions are not met.

stock_price <- 100
if (stock_price > 120) {
  print("The stock price has surged!")
} else if (stock_price > 110) {
  print("The stock price has increased moderately.")
} else if (stock_price > 100) {
  print("The stock price has increased slightly.")
} else {
  print("The stock price has decreased.")
}
[1] "The stock price has decreased."

Exercise

Write an if-else statement:

  • If PMT > 1000, add PMT with 10000 (i.e., PMT <- PMT + 10000)
  • Else if PMT > 500, add PMT with 100
  • Else, set PMT = 0

What is the outcome of above if-else, if initial PMT was 750?

For loops

For loops are used when code has to be iterated a specified number of times.

# Syntax
for (item in vector) {
  action
}
for (i in 1:5) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

If for loop was explictly written:

# loop variable i 
i <- 1
print(i)
[1] 1
i <- 2
print(i)
[1] 2
i <- 3
print(i)
[1] 3
i <- 4
print(i)
[1] 4
i <- 5
print(i)
[1] 5

for assigns the item in the current environment, overwriting existing variable with the same name.

i <- 100 # i exists before loop

for (i in 1:3) {}

print(i) # for loop overwrites
[1] 3

Items are accessed one by one in vector in for loop:

tickers <- c("AAPL", "BAC", "C", "DAL")
for (t in tickers) {
  print(t)
}
[1] "AAPL"
[1] "BAC"
[1] "C"
[1] "DAL"

To use index for each element: use seq_along() on the vector.

tickers <- c("AAPL", "BAC", "C", "DAL")
seq_along(tickers) # converts to integer sequence
[1] 1 2 3 4
for (num in seq_along(tickers)) {
  print(num)
}
[1] 1
[1] 2
[1] 3
[1] 4

When looping over date / times, loops strip the attributes:

dates <- as.Date(c("2020-01-01", "2020-05-01"))
for (dat in dates) {
  print(dat)
}
[1] 18262
[1] 18383

To workaround, use indexing with seq_along() and [[.

dates <- as.Date(c("2020-01-01", "2020-05-01"))
for (i in seq_along(dates)){
  print(dates[[i]])
}
[1] "2020-01-01"
[1] "2020-05-01"

For loop: preallocation

Memory Preallocation is creating the full size of the output object before the loop.

For example:

# preallocate empty numeric vector size of 10
output <- vector("numeric", length = 10)
print(output)
 [1] 0 0 0 0 0 0 0 0 0 0
for (i in 1:10) output[[i]] <- i**2 # same as i^2
print(output)
 [1]   1   4   9  16  25  36  49  64  81 100

Important tips when looping:

  1. Use bracket indexing [] instead of c()
  2. Preallocating the size of container is strongly recommended.

Best Example

# Think about the type and length of output before loop
N <- 10
output <- vector(mode = "list", length = N) # Create empty list output container

for (n in 1:N) {
  # Good: use square brackets indexing
  output[[n]] <- n**2 
}
print(output)
[[1]]
[1] 1

[[2]]
[1] 4

[[3]]
[1] 9

[[4]]
[1] 16

[[5]]
[1] 25

[[6]]
[1] 36

[[7]]
[1] 49

[[8]]
[1] 64

[[9]]
[1] 81

[[10]]
[1] 100

Bad Example

N <- 10
output <- numeric() # Empty vector without length

for (n in 1:N) {
  # Bad: use c()
  output <- c(output, n**2) 
}
print(output)
 [1]   1   4   9  16  25  36  49  64  81 100

Good Example

If preallocation is cumbersome, use list() for output container then convert to a vector if needed.

N <- 10
output <- list() # Empty list output container without length

for (n in 1:N) {
  # Good: use square brackets indexing
  output[[n]] <- n**2 # same as n^2
}
output <- as.numeric(output) # convert if needed
print(output)
 [1]   1   4   9  16  25  36  49  64  81 100

Research: Benchmarking

N <- 5000
list_prealloc <- vector("list", length = N)
list_noalloc <- list()
vector_prealloc <- vector("numeric", length = N)
vector_noalloc <- numeric()

bench::mark(
  list_prealloc_bracket = for (n in 1:N) {
    list_prealloc[[n]] <- n**2 
  },
  list_noalloc_bracket = for (n in 1:N) {
    list_noalloc[[n]] <- n**2
  }, 
  vector_prealloc_bracket = for (n in 1:N) {
    vector_prealloc[[n]] <- n**2
  },
  vector_noalloc_bracket = for (n in 1:N) {
    vector_noalloc[[n]] <- n**2
  },
  list_noalloc_c = for (n in 1:N) {
    list_noalloc <- c(list_noalloc, n**2)
  }, 
  vector_noalloc_c = for (n in 1:N) {
    vector_noalloc <- c(vector_noalloc, n**2)
  },
  list_prealloc_c = for (n in 1:N) {
    list_prealloc <- c(list_prealloc, n**2)
  },
  vector_prealloc_c = for (n in 1:N) {
    vector_prealloc <- c(vector_prealloc, n**2)
  },
  iterations = 5,
  check = FALSE
)
# A tibble: 8 × 6
  expression                   min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>              <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 list_prealloc_bracket      893µs    926µs   1053.      54.8KB      0  
2 list_noalloc_bracket       898µs    941µs   1025.     830.1KB      0  
3 vector_prealloc_bracket    809µs    895µs    695.      54.8KB    139. 
4 vector_noalloc_bracket     841µs    925µs   1079.     830.1KB      0  
5 list_noalloc_c             413ms    639ms      1.53   286.4MB     31.8
6 vector_noalloc_c           127ms    263ms      4.37   286.4MB     92.7
7 list_prealloc_c            357ms    639ms      1.54   286.4MB     32.6
8 vector_prealloc_c          124ms    220ms      4.35   286.4MB     93.1

Verdict

When performing loops:

  1. Preallocation + bracket indexing [] is the best.

  2. No preallocation is forgivable.

  3. Repeated use of c() is strongly discouraged.

next and break

Generally used with if-else condition tests inside loop.

next is used to skip an iteration of loop.

for (i in 1:5) {
  if (i == 3) {
    # skip if i == 3
    next
  }
  print(i)
}
[1] 1
[1] 2
[1] 4
[1] 5

break is used to exit loop immediately.

for (i in 1:5) {
  if (i == 4) {
    break
  }
  print(i)
}
[1] 1
[1] 2
[1] 3

For loop: Compound interest

How to calculate compound interest over multiple years using a for loop?

  • Principal: $10,000
  • Interest rate: 5%
  • Number of years: 10
principal <- 10000
rate <- 0.05
num_periods <- 10
# generate placeholder vector for the cashflow
cashflow <- vector("numeric", length = num_periods)

for (i in 1:num_periods) {
  value <- principal * (1 + rate)^i
  cashflow[[i]] <- value
}
print(cashflow)
 [1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
 [9] 15513.28 16288.95

Exercise

Based on the previous example, do the following:

  • Q1. Skip the first year using if and next

    • Cashflow should have zero (NULL) on the first slot
  • Q2. Stop the calculation if value exceeds $14,000

    • Cashflow should have zero (NULL) on slots that exceed value of $14,000

While loops

While loops begin with testing condition, and iterates the code as long as the condition is TRUE.

  • If not written properly, it can be infinite loop.
count <- 0
while (count < 3) {
  print(count)
  count <- count + 1 # add 1
}
[1] 0
[1] 1
[1] 2

Exercise

  1. Write code that print 1 to 10 using for loop.

  2. Achieve same result using while loop instead.

    • Start by defining a <- 1 outside of the loop
  3. Based on 2, tweak the code that skips printing number if it is 5.

    • Be careful not get into infinite loop!

Exercise 2

Write a function that checks class of an input.

If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”

  • use inherits(x, "numeric") for logical test.
# example outcome
my_function(c(1, 4, 5))
[1] "Numeric input!"
my_function(c('a', 'b', 'c'))
[1] "Not numeric!"

Function mapping

map function from purrr is an implicit function loop.

  • a function f is an input arg for map()
  • Succinct and easy to read than for loops
  • map() requires tidyverse or purrr package

Note

Functions that take other function as inputs are called functionals in R, like map().

Remember, though, if vectorized operation is possible, avoid using for loops or map.

Example: map()

  • Output is always list
plus_one <- function(x) {
  return(x + 1)
}
# need to import tidyverse to use map()
# library(tidyverse)
map(1:3, plus_one) # 1 2 3 is input for plus_one()
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

With for loop, code tends to be longer and requires preallocation.

# define empty list for output first
temp_list <- vector("list", length = 3)

for (i in 1:3) {
  # assign each to the list
  temp_list[[i]] <- plus_one(i)
}
print(temp_list)
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 4

Exercise

  1. Generate times_two() function that multiplies input by 2.

  2. map times_two function over 1:10

  3. Achieve same result with for loop.

map function 2

If the desired output is not list but atomic vector:

  • map_dbl() a numeric (double) vector
  • map_chr() a character vector
  • map_lgl() a logical vector
  • map_int() an integer vector
map(1:3, plus_one) |> class()
[1] "list"
map_dbl(1:3, plus_one) |> class()
[1] "numeric"

Vectorized Operation and Loops

Most function operations in R are vectorized by default.

  • Intuitive and faster: easier to read, write
  • R is built for those operations in mind
  • Avoid using for loops or map if vectorization is possible

Example: portfolio value

stock_prices <- c(150, 250, 100)
shares_held <- c(10, 5, 20)
# multiplication is vectorized
portfolio_value = stock_prices * shares_held
print(portfolio_value) # easy and faster
[1] 1500 1250 2000

A for loop approach:

N <- length(stock_prices) # length 3
portfolio_value <- vector("numeric", N) # container

for (i in 1:N) {
  portfolio_value[[i]] <- stock_prices[[i]] * shares_held[[i]]
}
print(portfolio_value)
[1] 1500 1250 2000

A map approach:

  • map * function to two input vectors (price, share)
  • map2() for this case, see ?map2 for more info

portfolio_value <- map2(stock_prices, shares_held, `*`)
print(portfolio_value)
[[1]]
[1] 1500

[[2]]
[1] 1250

[[3]]
[1] 2000

Benchmark comparison

bench::mark(
  vectorizing = {portfolio_value <- stock_prices * shares_held},
  map2 = {portfolio_value <- map2(stock_prices, shares_held, `*`)},
  for_loop_prealloc = {
    portfolio_value <- vector("numeric", N) # container
    for (i in 1:N) {
      portfolio_value[[i]] <- stock_prices[[i]] * shares_held[[i]]
    }},
  iterations = 100,
  check = FALSE
)
# A tibble: 3 × 6
  expression             min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 vectorizing           41ns     82ns  4533503.        0B      0  
2 map2                  46µs  53.51µs    18094.      264B      0  
3 for_loop_prealloc    887µs   1.01ms      983.    20.1KB     30.4

Exercise

stock_prices <- c(150, 250, 100, 250, 300)
shares_held <- c(10, 5, 20, 5, 10)

Generate portfolio value of each asset, using:

  1. Vectorized multiplication (vector output)
  2. for loop
  3. map (implicit loop)

Vectorized if-else

ifelse() function is a vectorized if else statement.

  • Useful when you have a vector of TRUE / FASE condition tests
  • No need to loop over each element of vector
ifelse(test, yes, no)

Example: Dividend Payments

stock_prices <- c(45, 60, 52, 48, 55, 49)

dividends <- ifelse(
  stock_prices > 50, # test
  stock_prices * 0.02, # yes
  stock_prices * 0.015 # no
)

print(dividends)
[1] 0.675 1.200 1.040 0.720 1.100 0.735

Exercise

stock_prices <- c(45, 60, 52, 48, 55, 49)

Practice ifelse():

  • if stock price is greater than 53, assign “Bull”
  • otherwise “Bear”
  • assign it to sentiment object.

Vectorized if-else 2

case_when() from tidyverse package is a general vectorized if-else.

library(tidyverse)
stock_prices <- c(45, 60, 52, 48, 55, 49)

sentiment2 <- case_when(
  stock_prices > 55 ~ "Bull", # if TRUE
  stock_prices > 53 ~ "Normal", # else if TRUE
  stock_prices > 49 ~ "Weak", # else if TRUE
  .default = "Bear" # else
)
print(sentiment2)
[1] "Bear"   "Bull"   "Weak"   "Bear"   "Normal" "Bear"  

Exercise 2

stock_prices <- c(45, 60, 52, 48, 55, 49)

Practice case_when():

  • if stock price is greater than 58, assign “Bull”
  • if stock price is greater than 50, assign “Normal”
  • otherwise “Bear”
  • assign it to sentiment2 object.

File Systems in R

Package fs

fs package provides simple and consistent way to:

  • Path operations
  • File and directory control
  • File information
  • Cross-platform
# Install package
if (!requireNamespace("fs", quietly = TRUE)) {
  install.packages("fs")
}

What is a Path?

A path is a string of characters used to uniquely identify a file or folder in a file system.

Types of paths:

  • Absolute path: exact location of a file or directory from the root.
  • Relative path: location relative to the current working directory.

Working Directory

The working directory is the location where the program (R, bash, Python, etc) is running on.

  • getwd() shows the current working directory.
  • setwd("/path/to/directory") changes working directory to specified path.

Absolute Paths

  • Begins from the root directory (/ in Mac/Linux, C:\ in Windows)
  • Example (Mac/Linux): /Users/username/Documents/project/data.csv
  • Example (Windows): C:\Users\username\Documents\project\data.csv

Absoulte paths are unambiguous.

Relative Paths

  • Path that is relative to the current working directory.
  • Example: ./data/project/data.csv (The . denotes the current directory)
  • Succinct and easier to manage path in projects

Directory references

.: the current directory.

  • From /Users/john/projects, ./data refers to /Users/john/projects/data.

..: the parent directory; one level up from the current directory.

  • From /Users/john/projects, ../data refers to /Users/john/data.

~: the home directory.

  • Default directory for user in OS

Home Directory

The “default” directory for user in operating system

Mac/Linux: /Users/<username>. - Example: If your username is john, your home directory would be /Users/john.

Windows: C:\Users\<username> - Example: If your username is john, your home directory would be C:\Users\john

Tilde ~

Represents the user’s home directory.

Example: ~/cases refers to the cases folder in the user’s home directory:

  • C:\Users\<username>\cases for windows
  • /Users/john/cases for Mac/Linux

Creating File/Directory

Creatie / delete file and directory are simple:

# Create file and folder in the working directory
library(fs)
file_create("my_Rscript.R")
dir_create("my_folder")
file_delete("my_Rscript.R")
dir_delete("my_folder")

List files and directories

List files and directories:

dir_ls(".") # list files in current directory

It’s especially useful with globbing / regex:

dir_ls(glob = "*.txt") # list txt files

Exercise

  1. Create an R script file named: fs_exercise.R on your working directory.
# Check your working directory with
getwd()
  1. List all files that has .R file extension.

  2. What is the absolute path of the script file?

  3. From your home(~), what is the relative path of the script file?

Text Data Files

A plain, human-readable text data file, delimited by a specific character

  • Comma-Separated Values (CSV) with ,
  • Tab-Separated Values (TSV) with \t
  • Since it is text, R tries to “guess” the correct data type of each column when importing

Text Data Example

A text data file typically looks like:

  • Usually the first line is a header (column names)
  • Data values separated by a delimiter (e.g., , for CSV, \t for TSV)
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
# TSV
iris |> head(3) |> format_tsv() |> cat()
Sepal.Length    Sepal.Width Petal.Length    Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3   1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa

Text Data files

Many packages support writing/reading csv/tsv files;

  • base R (utils package): basic, slow
  • readr from tidyverse: extremely fast, functional

Write CSV / TSV

To write a data.frame to a csv file: write_csv()

# Saves csv / tsv file on current working directory
write_csv(iris, "iris_file.csv")
write_tsv(iris, "iris_file.tsv")

To read a .csv / .tsv file to a data.frame: read_csv(), read_tsv()

my_iris <- read_csv('iris_file.csv')
my_iris2 <- read_tsv("iris_file.tsv")

Other data formats

There are other common data formats:

  • “.xlsx”: excel spread sheets
  • “.json”: javascript object notation (NoSQL)
  • “.parquet”: columnar big data storage
  • ‘.sas7bdat’, ‘.dta’

R Data frame (and tibble) class

Data frames

One of the most important data class in R, built on top of list type.

  • Stores data structure in 2D tabular form:

    • with rows (observations, or records)

    • and columns (variables)

  • Columns can be different types!

Create data frame

Creating a data.frame is almost identical to list.

# creating data.frame is similar to creating a list
my_dataframe <- data.frame(
  a = c(1, 2, 3),
  b = c('a', 'b', 'c'),
  c = c(TRUE, FALSE, FALSE)
)
print(my_dataframe)
  a b     c
1 1 a  TRUE
2 2 b FALSE
3 3 c FALSE

Exercise

Create a dataframe named as housing:

  • 6 columns: Name, Age, Sex, Income, Housing, Zipcode
    • Name: Amy, Bill, Charles, Donna, Eckert
    • Age: 21, 25, 30, 38, 49
    • Sex: Female, Male, Male, Female, Male
    • Income: 36000, 53000, 89000, 82000, 166000
    • Housing: Rent, Rent, Own, Own, Rent
    • Zipcode: 12333, 12543, 11255, 12333, 33533

What are the type (class) of each column automatically recognized by R?

  • Check with str(housing).

Q: What should be their type (class) in theory?

Tibble class

Essentially the same as dataframe class, with some fix:

  • Fixes old inconsistencies in R data.frame class
  • Safer executions
  • Better console displays

as_tibble() converts data.frame class to tibble class.

Example

A toy dataset, iris dataframe:

head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Class of iris:

class(iris)
[1] "data.frame"

Convert iris to tibble class:

  • Prints the dimension
  • Prints data class by column
iris_tb <- as_tibble(iris)
head(iris_tb, 3)
# A tibble: 3 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <fct>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 

iris_tb is a multi-class object that is both tibble and dataframe.

class(iris_tb)
[1] "tbl_df"     "tbl"        "data.frame"

Access operations

As it is built on lists, [, [[, $ also works on data.frames.

  • bracket operations align more consistently with tibble class

Dataframe can be subsetted with df[i,j]

  • i part operates on row (called filtering)
  • j part selects columns

Bracket [, [[ subsetting

iris_tb[, 1] # vector? tibble?
iris_tb[1, 1] # vector? tibble?
iris_tb[1,] # vector? tibble?

1st row, 1st column, in element class (numeric vector)

iris_tb[[1, 1]]
[1] 5.1

Caution

Double bracket [[ works with , only when both row and columns are mentioned. That is, iris_tb[[1,1]] works, but iris_tb[[,1]] doesn’t.

To pull in element’s (vector) class, you’ll learn pull().

Other examples:

# Exercise: try!
iris_tb[, c("Sepal.Length", "Petal.Width")]
iris_tb[1:5, 1:2]
iris_tb[iris_tb[["Sepal.Length"]] > 3.1, 1] # filtering

If comma is not provided, it assumes a column index.

# try and see the difference!
iris_tb[1]
iris_tb["Sepal.Length"]
iris_tb[["Sepal.Length"]]

$ subsetting

$ pulls a single column in element’s class from data frame (tibble).

iris_tb$Sepal.Length

Exercise

  1. Generate iris_tb by converting iris using as_tibble().

  2. Exercise all subsetting methods on rows in the 2nd column:

  • Use single bracket and integer index iris_tb[2]
  • Use double bracket and integer index iris_tb[[2]]
  • Use single bracket and column name iris_tb[colname]
  • Use single bracket and column name iris_tb[[colname]]
  • Use $
  1. Filter rows with “Sepal.Length > 5”. How many rows do you observe?
  • use nrow() to check the number of rows.

Chained subset call

Subsetting can be chained

  • Pause and take a look at example. What is happening?
iris_tb[21:100, c('Petal.Length', 'Sepal.Length')][25, ]
  • Confirm with iris_tb[45, c("Petal.Length", "Sepal.Length")]

Exercise

On iris_tb,

  • Filter with “Petal.Length > 4.5”
  • Select columns “Species”, “Petal.Length”
  • then slice rows from 1 to 10
  • Store this as filtered_iris

Q. Confirm the average of Petal.Length from filtered_iris is 4.74.

  • Use mean(dataframe$column).

Assign and remove column

They operate as same as lists. To assign a new variable within the data.frame, use:

data.frame["NEW_COLUMN_NAME"] <- NEW_COLUMN
data.frame[["NEW_COLUMN_NAME"]] <- NEW_COLUMN # same

# EXAMPLE
iris_tb["Petal.Area"] <- iris$Petal.Length * iris$Petal.Width

To remove a variable from the data.frame, use:

data.frame["EXISTING_NAME"] <- NULL
data.frame[["EXISTING_NAME"]] <- NULL

Modern syntax: dplyr package

R package for dataframe manipulation tasks.

  • A grammar of data manipulation
  • Replaces the use of [, [[, $ in most cases
  • Intuitive and easy to understand
  • Fast, written in C++

Prep: Company Financials Data

Company_financials.csv data will be available in our GitHub Class repository.

  • Option 1: Git pull and copy the data to your class working folder
# install.packages("fs")
# Adjust the address to your setting
fs::file_copy(
  "Class_repo/FIN4770_Spring2025/Data/Company_Financials.csv", # from
  "Company_Financials.csv" # to
  )
  • Option 2: Direct address typing as below
library(tidyverse)
url <- "https://raw.githubusercontent.com/matthewgson/FIN4770_Spring2025/refs/heads/main/Data/Company_financials.csv"
fin_data <- read_csv(url)

Data Overview

  • Balance Sheet items: Assets, Liabilities, Equity, etc
  • Year: reporting year
  • Industry: Industry classification for each company
  • Company: Ticker symbol

The select() verb

select() lets you choose specific columns.

  • by column names
  • by index
  • by helper functions (starts_with, ends_with, etc)
fin_data |> 
  select(ticker, Industry)  # column names
fin_data |> 
  select(1:5) # index

Suppose you want to select all current items that starts with “current”.

fin_data |> 
  select(ticker, Industry, year, starts_with("current"))

or ends with “libabilities”.

fin_data |> 
  select(ticker, Industry, year, ends_with("liabilities"))

or contains “asset”

fin_data |> 
  select(ticker, Industry, year, contains("asset"))

The relocate() verb

relocate() is used to change the order of columns.

fin_data |> 
  relocate(market_cap) # move to left-most

fin_data |> 
  relocate(market_cap, .before = market_cap) # move before market_cap

fin_data |> 
  relocate(market_cap, .after = last_col()) # move to far-right

The rename() verb

rename() changes the column names.

fin_data |> 
  rename(
    Ticker = ticker, # left is new name
    Year = year
  )

The pull() verb

pull() extracts a single column as a vector.

fin_data |> 
  pull(ticker) # vector

fin_data |> 
  select(ticker) # tibble

Exercise

Using the dataset fin_data:

  1. Create new tibble that includes only “ticker”, “Industry”, “year”, “market_cap” and column that starts with “current”.

  2. Relocate column “market_cap” as the first column.

  3. Rename “market_cap” to “Market_Cap” column.

  4. Pull “ticker” as a vector from fin_data.

  5. Combine 1 to 4 with pipe chain to achieve all at once.

The filter() verb

filter() lets you choose rows based on conditions.

# Single filter
fin_data |> 
  filter(year >= 2022)

# Multiple filter
fin_data |> 
  filter(
    year >= 2022,
    Industry == "Financials"
  )

fin_data |> 
  filter(
    between(market_cap, 1e11, 1e12), # between 100B to 1T
    Industry %in% c("Financials", "Energy")
  )

Apply filters across columns: if_any() and if_all()

Imagine your dataset includes multiple asset columns (e.g., current assets and current liabilities). You want to filter rows where any asset value exceeds $100B (1e11).

fin_data |> \
  filter(
    if_any(
      c(current_assets, current_liabilities), \(x) x > 1e11))

if_all() for strict filtering:

fin_data |> 
  filter(
    if_all(
      c(current_assets, current_liabilities), \(x) x > 1e11))

The slice() verb

slice() extrats rows based on simple positions.

fin_data |> 
  slice(1:5) 

fin_data |> 
  slice(-(1:10)) # drop

fin_data |> 
  slice_sample(n = 10) # random

fin_data |> 
  slice_max(current_assets, n = 3) # top 3 values

The disctinct() verb

distinct() removes duplicate rows based on referred columns.

fin_data |> 
  distinct(Industry) # distinct values of Industry

fin_data |> 
  distinct(Industry, .keep_all = TRUE) # keeps other columns

Exercise

From fin_data:

  1. Filter that contains only rows where year is greater than 2022 and Industry is “Financials”.

  2. Filter rows where any of columns that contains “asset” exceeds $100B (1e11).

  3. Filter rows where all of columns that contains “current” exceed $10B (1e10).

  4. Slice first 3 rows of the data.

  5. Show distinct values of “Industry” in the data, and keep other columns.

The arrange() verb

arrange() reorders the rows by one or more columns.

fin_data |> 
  arrange(current_assets) # ascending by default

fin_data |> 
  arrange(desc(current_assets)) # descending

fin_data |> 
  arrange(ticker, year, current_assets) # hierachical ordering

Caution

Best practice: DO NOT CHAIN ARRANGE - it resets reordering.

# Wrong
fin_data |> 
  arrange(ticker) |> 
  arrange(current_assets)

# Correct
fin_data |> 
  arrange(ticker, current_assets)

The mutate() verb

mutate() lets you create or modify columns.

fin_data |> 
  mutate(
    debt_asset_ratio = current_debt / current_assets,
    equity = total_assets - total_debt
  )

Using general if-else with case_when() to classify:

fin_data |> 
  mutate(
    debt_asset_ratio = current_debt / current_assets,
    Leverage_category = case_when(
      debt_asset_ratio >= 0.8 ~ "High Leverage",
      debt_asset_ratio >= 0.5 ~ "Moderate Leverage",
      TRUE ~ "Low Leverage")
      )

Exercise

From fin_data:

  1. Arrange the data by ticker (ascending) and year.

  2. Create new variable “debt_to_asset_ratio” as the ratio of current_debt to current_assets.

The summarize() verb

summarize() computes statistics for the entire dataset.

fin_data |> 
  summarize(
    avg_current_assets = mean(current_assets, na.rm = TRUE),
    max_assets = max(current_assets, na.rm = TRUE)
  )

You can summarize by groups:

fin_data |> 
  group_by(ticker) |> 
  summarize(
    avg_current_assets = mean(current_assets, na.rm = TRUE)
  ) |> 
  ungroup() # manual ungroup needed

Or simply use .by in the summarize()

fin_data |> 
  summarize(
    avg_current_assets = mean(current_assets, na.rm = TRUE),
    .by = ticker
  ) # ungroupped after summary

Lab Problem: Year-over-Year Growth Calculation

From fin_data:

  1. Arrange the dataset by ticker and year in ascending order. Then, group the data by ticker.

  2. Use mutate() along with the lag() function to calculate the year-over-year growth rate for current_assets. Name the variable as yearly_asset_growth.

\(\frac{\mathrm{current_assets} - \mathrm{lag(current_assets)}}{\mathrm{lag(current_assets)}}\)

  1. Summarize the average of yearly growth rate by each tickers.

dplyr grammar summary

Key verbs

  1. select() : select subset of columns
  • rename() : rename columns
  • relocate() : change column positions
  • pull() : extract single column as vector
  1. filter() : select subset of rows with condition
  • slice() : extract specific rows
  • distinct() : remove duplicate rows
  1. arrange() : reorder rows

  2. mutate() : add new columns (variables)

  3. summarize() : generate summary table

  • group_by() / ungroup()

Portfolio Sorting with Crypto

A mini finance project

Incorporating AI for Coding

From now on, I’ll introduce how to leverage AI for coding.

  • Generating code snippets

  • Troubleshoot and debug

  • Best Practices

Sample GenAIs for code

  1. ChatGPT
  2. Grok
  3. Claude.ai (limited)
  4. Meta.ai
  5. Gemini

Crypto Analysis with dplyr and tidyverse

Learn to:

  • Financial data manipulation
  • Calculate average returns and volatility
  • Sort cryptocurrencies into portfolios
  • Compare performance of different portfolios
  • Visualize performance

Let’s optimize our crypto investments!

Prep Required Libraries

# Install yourself before loading
library(tidyquant) 
library(tidyverse)

Our list of cryptos: 9 sample

crypto_coins <- c("BTC-USD", "ETH-USD", "BNB-USD", 
                  "SOL-USD", "XRP-USD", "DOT-USD", 
                  "DOGE-USD", "MATIC-USD", "ADA-USD")

Get data

# Get historical prices for the selected coins
crypto_data <- tq_get(crypto_coins,
                      from = "2020-01-01",
                      to = "2023-01-01") # 3 years

How many observations are found for each crpto?

# Not all of coins have same number of observations
crypto_data |> 
  summarize(n(), .by = symbol) 
# A tibble: 9 × 2
  symbol    `n()`
  <chr>     <int>
1 BTC-USD    1097
2 ETH-USD    1097
3 BNB-USD    1097
4 SOL-USD     997
5 XRP-USD    1097
6 DOT-USD     865
7 DOGE-USD   1097
8 MATIC-USD  1097
9 ADA-USD    1097

Calculate Returns

Calculate daily returns with arrange(), group_by() and mutate()

crypto_data <- crypto_data |> 
  arrange(symbol, date) |> 
  group_by(symbol) |> 
  mutate(daily_ret = adjusted / lag(adjusted) - 1) |> # arithmetic return
  ungroup()

crypto_data |> head()
# A tibble: 6 × 9
  symbol  date         open   high    low  close   volume adjusted daily_ret
  <chr>   <date>      <dbl>  <dbl>  <dbl>  <dbl>    <dbl>    <dbl>     <dbl>
1 ADA-USD 2020-01-01 0.0328 0.0338 0.0327 0.0335 22948374   0.0335  NA      
2 ADA-USD 2020-01-02 0.0335 0.0335 0.0324 0.0328 20843934   0.0328  -0.0211 
3 ADA-USD 2020-01-03 0.0327 0.0344 0.0325 0.0342 30162644   0.0342   0.0436 
4 ADA-USD 2020-01-04 0.0342 0.0347 0.0339 0.0346 29535781   0.0346   0.0121 
5 ADA-USD 2020-01-05 0.0346 0.0354 0.0345 0.0347 21479178   0.0347   0.00364
6 ADA-USD 2020-01-06 0.0348 0.0373 0.0347 0.0373 37988444   0.0373   0.0735 

Performance metrics

Calculate performance metrics with group_by() and summarize()

performance_metrics <- crypto_data |> 
  group_by(symbol) |> 
  summarize(
    avg_daily_ret = mean(daily_ret, na.rm = TRUE), 
    vol_daily = sd(daily_ret, na.rm = TRUE)
  )
performance_metrics
# A tibble: 9 × 3
  symbol    avg_daily_ret vol_daily
  <chr>             <dbl>     <dbl>
1 ADA-USD         0.00356    0.0588
2 BNB-USD         0.00421    0.0570
3 BTC-USD         0.00150    0.0379
4 DOGE-USD        0.00807    0.135 
5 DOT-USD         0.00270    0.0678
6 ETH-USD         0.00333    0.0505
7 MATIC-USD       0.00668    0.0805
8 SOL-USD         0.00548    0.0793
9 XRP-USD         0.00247    0.0634

Visualize metrics

Visualize performance metrics with ggplot(), to generate barplot:

performance_metrics |> 
  ggplot(
    aes(x = fct_reorder(symbol, -avg_daily_ret), y = avg_daily_ret, fill = symbol )
  ) +
  geom_col() + 
  scale_y_continuous(labels = scales::percent_format())+
  labs(
    title = "Average Daily Return of Cryptos",
    subtitle = "Year 2020 - 2022",
    caption = "Data: Yahoo Finance",
    x = "Crypto",
    y = "Average Return",
    fill = "Symbol"
  ) +
  theme_minimal()

Similarly for volatility:

performance_metrics |> 
  ggplot(
    aes(x = fct_reorder(symbol, vol_daily), y = vol_daily, fill = symbol )
  ) +
  geom_col() + 
  scale_y_continuous(labels = scales::percent_format())+
  labs(
    title = "Average Daily Return of Cryptos",
    subtitle = "Year 2020 - 2022",
    caption = "Data: Yahoo Finance",
    x = "Crypto",
    y = "Average Return",
    fill = "Symbol"
  ) +
  theme_minimal()

To combine and juxtapose (simple):

  • Can’t use double Y axis with this case
  • Reorder factor before pivotting if needed
performance_metrics |> 
  mutate(symbol = fct_reorder(symbol, desc(avg_daily_ret))) |> 
  pivot_longer(cols = !symbol) |>  # make long form
  ggplot(
    aes(x = symbol, y = value, fill = name)
  ) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Average Daily Return / Volatility of Cryptos",
    subtitle = "Year 2020 - 2022",
    caption = "Data: Yahoo Finance",
    x = "Crypto",
    y = "Average Return / Volatility (%)",
    fill = "Metric"
  ) 

To combine and juxtapose (advanced):

  • Dual Y Axis technique
# Since return is smaller: scale by their max values
scale_factor <- max(performance_metrics$avg_daily_ret) / max(performance_metrics$vol_daily)

performance_metrics |> 
  ggplot(aes(x = fct_reorder(symbol, -avg_daily_ret))) +
  geom_col(
    aes(y = avg_daily_ret, fill = "Average Return"),
    position = position_nudge(x=-0.2), # move to left
    width = 0.4
  ) +
  geom_col(
    aes(y = vol_daily * scale_factor, fill = "Volatility"), # notice the scale factor
    position = position_nudge(x=0.2), # move to right
    width = 0.4) +
  scale_y_continuous(
    name = "Average Return (%)",
    labels = scales::percent_format(),
    sec.axis = sec_axis(
      \(x) x / scale_factor, 
      name = "Volatility (%)",
      labels = scales::percent_format())
  ) +
  labs(
    title = "Average Return and Volatiliy, Dual Axis",
    subtitle = "Year 2020 - 2022",
    caption = "Data: Yahoo Finance",
    x = "Crypto",
    fill = "Metric"
  )  +
  theme_bw()

Portfolio Sorting: Trading volume

A simple univariate portfolio sorting to see if “trading volume” predicts future returns.

  1. Calculate average trading volume and sort into quintile (5) groups
ranks <- crypto_data |> 
  summarize(avg_daily_volume = mean(volume, na.rm = TRUE), .by = symbol)
ranks <- ranks |> 
  mutate(
    volume_rank = ntile(avg_daily_volume, 5) # Ascending, Low 1 High 5
  )
head(ranks)
# A tibble: 6 × 3
  symbol   avg_daily_volume volume_rank
  <chr>               <dbl>       <int>
1 ADA-USD       1874739609.           3
2 BNB-USD       1554934937.           2
3 BTC-USD      36702325339.           5
4 DOGE-USD      1712255288.           3
5 DOT-USD       1392992622.           2
6 ETH-USD      18924519134.           4

Let’s test if volume explains future crypto returns:

  • Volume observed data from 2020-01-01 to 2023-01-01
  • 1 month future daily returns
# Get data
future_crypto <- tq_get(crypto_coins, from = "2023-01-02", to = "2023-02-01")
future_crypto <- future_crypto |> 
  arrange(symbol,date) |> 
  mutate(daily_ret = adjusted / lag(adjusted) - 1, .by = symbol)

Join rank (from past observation) to future crypto data using “symbol” as key

future_crypto <- future_crypto |> 
  left_join(ranks, by = join_by(symbol)) # Now ranks are joined
head(future_crypto)
# A tibble: 6 × 11
  symbol  date        open  high   low close    volume adjusted daily_ret
  <chr>   <date>     <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>     <dbl>
1 ADA-USD 2023-01-02 0.250 0.256 0.247 0.254 159328803    0.254  NA      
2 ADA-USD 2023-01-03 0.254 0.255 0.251 0.253 153555529    0.253  -0.00407
3 ADA-USD 2023-01-04 0.253 0.270 0.252 0.268 289945179    0.268   0.0589 
4 ADA-USD 2023-01-05 0.268 0.270 0.264 0.269 175511469    0.269   0.00532
5 ADA-USD 2023-01-06 0.269 0.279 0.268 0.279 326480796    0.279   0.0355 
6 ADA-USD 2023-01-07 0.279 0.280 0.273 0.277 166488086    0.277  -0.00555
# ℹ 2 more variables: avg_daily_volume <dbl>, volume_rank <int>

Generate average crypto daily return by volume rank:

portfolio_analysis <- future_crypto |> 
  summarize(avg_daily_ret_by_volume_sorting = mean(daily_ret, na.rm = TRUE), .by = volume_rank)
portfolio_analysis |> 
  arrange(volume_rank)
# A tibble: 5 × 2
  volume_rank avg_daily_ret_by_volume_sorting
        <int>                           <dbl>
1           1                         0.0228 
2           2                         0.0109 
3           3                         0.0126 
4           4                         0.00829
5           5                         0.0121 

Visualize (bar plot): basic plot

portfolio_analysis |> 
  ggplot(aes(x = volume_rank, y = avg_daily_ret_by_volume_sorting, fill = volume_rank)) +
  geom_col()

Problems:

  • volume_rank is considered as numeric
  • labels, themes, etc

Finalizing plot:

portfolio_analysis |> 
  mutate(volume_rank = as.factor(volume_rank)) |> 
  ggplot(aes(x = volume_rank, y = avg_daily_ret_by_volume_sorting, fill = volume_rank)) +
  geom_col() + 
  labs(
    title = "Average Daily Crypto return by Volume Sorting",
    subtitle = "January 2023 ",
    caption = "Quintile Volume Sorting from 2020-2022 Data",
    x = "Volume Rank (1 Low 5 High Volume)",
    y = "Average Daily Return (%)",
    fill = "Volume Rank"
  ) + 
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_brewer(palette = "Set1") +
  theme_bw()

Notes

This analysis for a demo. For a more rigor, consider:

  • Expand the sample scope (8~9 cryptos may not be good enough to generalize)
  • Test on different time frames

Factor Vectors

Factors

Factors represent categorical variables that contains a fixed and known set of possible values.

They’re useful when you want to display character vectors in a specific, non-alphabetical order.

Note

I introduce forcats::fct() instead of base R’s factor(), which improves its behavior.

Why Use Factors?

Factors solve two common problems with character vectors:

  1. Typos and invalid entries: Factors restrict inputs to predefined categories.

  2. Sorting: Factors can sort according to a custom order, rather than alphabetically.

Creating Factor Variables

Use the forcats::fct() function from the forcats package (part of the tidyverse):

library(tidyverse)

# The valid bond ratings
rating_levels <- c("AAA", "AA", "A", "BBB", "BB", "B", "CCC")

# Create a factor of bond ratings using valid levels
ratings <- fct(c("BBB", "AA", "A", "CCC"), levels = rating_levels)
ratings
[1] BBB AA  A   CCC
Levels: AAA AA A BBB BB B CCC

Sorting factors respects the defined level’s sequence, as if it was an order:

sort(ratings)
[1] AA  A   BBB CCC
Levels: AAA AA A BBB BB B CCC

If values not in the levels appear, forcats::fct() raises an error:

ratings_invalid <- fct(c("BBB", "AA", "D", "A"), levels = rating_levels)
Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "D"
ratings_invalid
Error: object 'ratings_invalid' not found

If levels are not mentioned when defined, it honors initial input ordering:

country_factor <- fct(c("USA", "Canada", "South Korea"))
levels(country_factor)
[1] "USA"         "Canada"      "South Korea"

Tip

Base R’s factor() doesn’t behave like this, but uses alphabetical ordering, which is discouraged behavior.

Accessing Factor Levels

level is very important attribute of factor objects.

attributes(ratings)
$levels
[1] "AAA" "AA"  "A"   "BBB" "BB"  "B"   "CCC"

$class
[1] "factor"

To modify or browse level attribute, use levels(). If levels change, corresponding values are recoded.

levels(ratings)
[1] "AAA" "AA"  "A"   "BBB" "BB"  "B"   "CCC"
levels(ratings) <- 1:7
ratings
[1] 4 2 3 7
Levels: 1 2 3 4 5 6 7

AAA becomes 1, AA becomes 2, and so on, preserving the order.

Recoding Factor Levels

A convenient and safer way to recode is to use fct_recode()

rating_levels <- c(
  "AAA", "AA", "A", 
  "BBB", "BB", "B", "CCC") # possible values
bond_ratings <- fct(
  c("AAA","BBB", "A", "BB", "AA"), 
  levels = rating_levels)

bond_ratings <- fct_recode(bond_ratings,
  "Top Tier" = "AAA",
  "High Grade" = "AA",
  "Medium Grade" = "A",
  "Lower Grade" = "BBB",
  "Speculative" = "BB"
)
bond_ratings # B and CCC level remains
[1] Top Tier     Lower Grade  Medium Grade Speculative  High Grade  
Levels: Top Tier High Grade Medium Grade Lower Grade Speculative B CCC

Collapsing Factor Levels

Use fct_collapse() to combine multiple levels:

credit_ratings <- fct(c("AAA", "AA", "A", "BBB", "BB", "B"))

credit_ratings_collapsed <- fct_collapse(credit_ratings,
  "Investment" = c("AAA", "AA", "A"),
  "Speculative" = c("BBB", "BB", "B")
)

credit_ratings_collapsed
[1] Investment  Investment  Investment  Speculative Speculative Speculative
Levels: Investment Speculative

Reorder Factor level by hand

You can reorder level by using fct_relevel(). Any levels not mentioned will be left in their existing order, after the explicitly mentioned levels.

stock_returns <- tibble(
  Ticker = fct(c("AAPL", "MSFT", "GOOG", "JPM", "BAC")),
  Sector = fct(c("Technology", "Technology", "Technology", "Financial", "Financial")),
  Return = c(0.12, 0.08, 0.10, 0.05, 0.04)
)

stock_returns |> 
  mutate(Reordered_Ticker = fct_relevel(Ticker, "GOOG")) |> 
  pull(Reordered_Ticker)
[1] AAPL MSFT GOOG JPM  BAC 
Levels: GOOG AAPL MSFT JPM BAC

Reorder Factor level by variable

Use fct_reorder(f,x) to reorder factor level according to x. It doen’t change position of real value, but level!

stock_returns |> 
  mutate(Ticker_reordered = fct_reorder(Ticker, Return)) |> 
  pull(Ticker_reordered)
[1] AAPL MSFT GOOG JPM  BAC 
Levels: BAC JPM MSFT GOOG AAPL
stock_returns |> 
  ggplot(aes(x = Return, y = fct_reorder(Ticker, Return))) +
  geom_col() +
  labs(title = "Stock Returns by Ticker", y = "Ticker", x = "Return")

Lump infrequent levels

fct_other() lumps together infrequent levels to “other” category.

ratings_full <- factor(
  c("AAA", "AA", "A", "BBB", 
  "BB", "B", "CCC", "AA", 
  "CCC", "B", "BBB", "D", 
  "E", "A", "BBB", "BBB"))

ratings_grouped <- fct_other(ratings_full, keep = c("AAA", "AA", "A", "BBB"))
ratings_grouped
 [1] AAA   AA    A     BBB   Other Other Other AA    Other Other BBB   Other
[13] Other A     BBB   BBB  
Levels: A AA AAA BBB Other

Lump infrequent levels by frequency

fct_lump() lumps together infrequent levels to “other” category, by n or prop

fct_lump(ratings_full, n = 2) # most prevalent 2
 [1] Other AA    A     BBB   Other B     CCC   AA    CCC   B     BBB   Other
[13] Other A     BBB   BBB  
Levels: A AA B BBB CCC Other
fct_lump(ratings_full, prop = 0.1) # at least 10% proportion
 [1] Other AA    A     BBB   Other B     CCC   AA    CCC   B     BBB   Other
[13] Other A     BBB   BBB  
Levels: A AA B BBB CCC Other

Anonymize Levels

The fct_anon() function replaces the existing levels with anonymous (generic) labels.

ratings <- factor(c("AAA", "AA", "A", "BBB", "BB"))

# Replace factor levels with anonymous labels
anonymous_ratings <- fct_anon(ratings)
anonymous_ratings
[1] 1 2 4 3 5
Levels: 1 2 3 4 5

Exercises

Create a factor market_regime from the vector c("Bear", "Sideways", "Bull") such that the order is Bull, Sideways, then Bear.

Then, recode the levels to “Downturn”, “Flat”, and “Upturn”

Use fct_recode() to change the names.

Given the following tibble of regional sales data:

regional_sales <- tibble(
  region = c(
    "Northwest", "Southeast", 
    "Midwest", "Northeast"),
  avg_sales = c(120, 80, 150, 100)
)

Reorder the region factor based on avg_sales in ascending order and create a bar plot showing average sales by region.

Use fct_reorder(region, avg_sales) inside ggplot(aes())

You have a factor investment_style with the following values:

investment_style <- 
  fct(
    c("Growth", "Value", "Blend", 
    "Contrarian", "Speculative", 
    "Growth", "Value")
    )

Collapse the factor into two groups using fct_collapse()

Traditional: includes “Growth”, “Value”, “Blend”

Alternative: includes “Contrarian”, “Speculative”

Given a vector of currency codes:

currencies <- c(
  "USD", "EUR", "JPY", "USD", 
  "GBP", "AUD", "KRW", "EUR", 
  "USD", "USD", "EUR"
  )

Use fct_other() to lump together any currency other than “USD” as “Other” category.

Use fct_lump() to lump infrequent currency as “Other”, using n or prop

Logical Vectors

Logical vectors

They contain three possible values: TRUE, FALSE, and NA. Used extensively in data filtering, comparisons, and conditional transformations.

NA in other types

Though NA is logical, since other atomic vectors (integer,double,character) can contain missing values. There are corresponding NAs for each types:

NA_integer_ for integer

NA_real_ for double

NA_character_ for character

R handles the type conversion automatically when needed, so users don’t need to use it manually.

Review Logic

Q1. What is TRUE & FALSE?

Q2. What is TRUE | FALSE?

Q3. What is TRUE & TRUE?

Q4. What is FALSE | FALSE?

Q5. What is TRUE | TRUE?

Q6. What is FALSE & FALSE?

Missing Values: NA

NA represents missing data

Comparisons with NA return NA

Use is.na() to check for missing values.

NA > NA
[1] NA
NA == NA # NA returned
[1] NA
is.na(NA) # TRUE
[1] TRUE

NA and Logical Operators

Keep in mind the logic:

TRUE or whichever is TRUE

FALSE and whichever is FALSE

Guess the results:

NA == NA
NA & NA
NA | NA
TRUE & NA
TRUE | NA
TRUE == NA
FALSE & NA
FALSE == NA
FALSE | NA

Logical Values from Comparisons

x <- c(1, 2, 3, 4, 5)
x > 3  # Returns a vector of TRUE/FALSE values
[1] FALSE FALSE FALSE  TRUE  TRUE

Comparison operators: <, <=, >, >=, !=, ==

Modulo

The modulo operator (%%) is very useful for testing the divisibility of numbers

numbers <- 1:10

# Identify even numbers: A number is even if it leaves a remainder of 0 when divided by 2.
even_numbers <- numbers[numbers %% 2 == 0]
even_numbers # 2, 4, 6, 8, 10
[1]  2  4  6  8 10
# Identify odd numbers: A number is odd if it leaves a remainder of 1 when divided by 2.
odd_numbers <- numbers[numbers %% 2 == 1]
odd_numbers # 1, 3, 5, 7, 9
[1] 1 3 5 7 9

Boolean Algebra

New: Exclusive OR: xor

Combining Logical Vectors

x <- c(1, 2, 3, 4, 5)
(x > 3) & (x < 5)  # AND operator
[1] FALSE FALSE FALSE  TRUE FALSE
(x > 3) | (x < 2)  # OR operator
[1]  TRUE FALSE FALSE  TRUE  TRUE
!(x > 3)          # NOT operator
[1]  TRUE  TRUE  TRUE FALSE FALSE
xor((x > 3), (x < 5)) # Exclusive or
[1]  TRUE  TRUE  TRUE FALSE  TRUE

&: Element-wise AND

|: Element-wise OR

!: Negation

%in% Operator

Checks whether an element is found in other set.

x <- c("apple", "banana", "cherry")
x %in% c("apple", "grape") 
[1]  TRUE FALSE FALSE

Short-circuit operators

&& and || are short-circuit operators

  • Only evaluate on scalars and scalar output
  • Useful in programming (e.g., control flow)
  • Don’t use in dplyr functions!
TRUE && FALSE  # FALSE (evaluates only the first element)
c(TRUE, FALSE) && c(FALSE, TRUE)  # error

Floating point comparison

Checking equivalence with == with numeric (real numbers) is discouraged:

x <- c(1 / 49 * 49, sqrt(2) ^ 2)
print(x)
[1] 1 2

When checking with ==:

x == c(1, 2)
[1] FALSE FALSE

Why?

  • There’s no way to exactly represent 1/49 or sqrt(2) with fixed numbers whith decimal places
  • Computers store “close enough” numbers for real numbers
print(x, digits = 16)
[1] 0.9999999999999999 2.0000000000000004

That’s why == was failing.

To compare real numbers, use dplyr::near() function.

near(x, c(1,2)) # near() from dplyr package (included in tidyverse)
[1] TRUE TRUE

Logical Summaries

There are two logical summaries: any() and all()

any(x) is equivalent of |

  • TRUE if any of x is TRUE including NA

all(x) is equivalent of &

  • TRUE if all of x are TRUE
  • FALSE if any of x is FALSE including NA

Check below:

x <- c(TRUE, FALSE, TRUE)
any(x)
all(x)

y <- c(FALSE, FALSE, FALSE)
any(y)
all(y)

Exercise

  1. Write a line of code to create a logical vector from the numbers 1 to 10 that tests whether each number is greater than 5.
  • Hint: Use the comparison operator > on the vector 1:10.
  1. Using the modulo operator, write code to extract the even numbers from the vector 1:20.
  • Hint: A number is even if it leaves a remainder of 0 when divided by 2 (i.e., use %%)
  1. Explain the result of TRUE & NA. What does this tell you about logical operations involving NA?

  2. Suppose you have a vector of company tickers:

tickers <- c("AAPL", "MSFT", "GOOG", "AMZN")

and portfolio:

portfolio <- c("AAPL", "TSLA", "GOOG")

Write code to determine which tickers in tickers are present in portfolio. Yield a logical vector.

  1. Suppose you have a vector of daily returns for a stock,
daily_returns <- c(0.01, NA, -0.005, 0.02, NA)
  1. Use any() to check if there is at least one positive return.
  2. Use all() to check if all daily returns are positive.
  3. Explain the outputs considering the presence of NA values.
  • Check documentation ?any() and ?all()

Numeric Vectors

Numeric Vectors

Numeric vectors are the backbone of financial data. Numerics include:

  • Integer
  • Double (real, or float)

We will use tidyverse verbs—to manipulate numeric data in real-world finance examples.

Parse numbers

Sometimes numbers are stored as strings (characters), especially when data was imported from external sources.

  • parse_double() converts strings that are purely numeric

  • parse_number() extracts numeric parts from strings

parse_double() example:

# library(tidyverse)
price_str <- c("123.45", "67.89", "1e3")
typeof(price_str) # character

parsed_price <- parse_double(price_str)
print(parsed_price)
typeof(price_str)

parse_number() example:

# Bond yields recorded with currency symbols
yield_str <- c("5.25 percent", "Yield 4.75", "It is 6.00%")
parsed_yields <- parse_number(yield_str)
print(parsed_yields)

pmin, pmax

These functions compare values element-wise (rowwise in tibble).

bond_yields1 <- c(3.5, 3.6, 3.7)
bond_yields2 <- c(3.8, 3.4, 3.9)
yields <- tibble(bond_yields1, bond_yields2)
yields |> 
  mutate(
    min_yield = pmin(bond_yields1, bond_yields2), 
    max_yield = pmax(bond_yields1, bond_yields2))

Modular Arithmetic

Modular arithmetic is useful for breaking down composite numbers.

  • %/% integer division (quotient)
  • %% modulo operator (remainder)

For example, convert a time value in HHMM format to hours and minutes:

# Suppose bond market closes at 1559 (3:59 PM)
close_time <- 1559
hour <- close_time %/% 100 # quotient
minute <- close_time %% 100 # remainder

Logarithms

In Finance, logarithmic returns are often used. In R, log() is natural log. log2() and log10() have base of 2 and 10.

Logarithmic (Log) Returns

Calculated as the natural logarithm of the ratio of consecutive prices:

\(r_{log} = \ln\left(\frac{P_t}{P_{t-1}}\right)\)

Log returns are additive over time, which makes cumulative calculations more straightforward.

Key Differences between arithmetic and logarithmic returns:

Additivity

Log returns can be summed get the cumulative return, arithmetic returns must be compounded

Approximation

For small returns, log returns are very similar to arithmetic returns, but the difference becomes significant for larger returns

Returns comparison

Let’s see how to compute cumulative returns using both methods.

prices <- c(100, 102, 101, 105, 107)
results <- tibble(
  Day = 1:length(prices),
  Price = prices
)
results <- results |> 
  mutate(
    arith_ret = Price / lag(Price) - 1,
    log_ret = log(Price / lag(Price)),
    log_ret_convert = exp(log_ret) - 1,
    ) |> 
  drop_na()
print(results)

Cumulative Returns

results <- results |> mutate(
    cumprod_ari_ret = cumprod(1 + arith_ret) - 1,
    cumsum_log_ret = cumsum(log_ret),
    cumsum_log_ret_convert = exp(cumsum_log_ret) -1
  )
print(results)

Rounding

Rounding is key for reporting. Use round(), floor(), and ceiling().

price <- 123.456
floor(price)         # 123
floor(price * 100) / 100 # round down to the 2nd decimal
ceiling(price)       # 124
round(price, 2)      # 123.46 (nearest number)
round(price, -1)     # 120 (to the nearest ten)

Cuts

cut() bins numeric values into discrete intervals with custom breaks.

div_yields <- c(0.02, 0.03, 0.05, 0.07, 0.1)
yield_bins <- cut(div_yields, breaks = c(0, 0.03, 0.06, 0.1))
yield_bins <- cut(div_yields, breaks = c(0, 0.03, 0.06, 0.1), labels = c("Low", "Medium", "High"))

Offsets

dplyr::lead() and dplyr::lag() allow you refer to values just before or after.

prices <- c(150, 155, 160, 158)
lag(prices) # previous
lag(prices,2) # 2 times previous
lead(prices) # after

Positions

Extract positions: first(), last(), nth()

prices <- c(150, 155, 160, 158)
first(prices) 
last(prices) 
nth(prices,3) 

Exercise

For the price:

price <- 987.654

Round price to

  1. The nearest whole number
  2. Two decimal places
  3. Nearest ten

Parse below character vector of prices properly:

price_str <- c("120.50", "99.99", "1e2")
price_str2 <- c("The price is $152", "aiming $199 dollars per stock", "reached $358 dollars")

Date/Time Vectors

Dates and Time in Finance

In finance, tracking dates and times is critical for modeling transactions, trade dates, settlement dates, and market events.

Although dates and times seem straightforward, they involve complexities such as:

  • leap years
  • time zones
  • daylight saving time

Create date and time

There are three types:

  • A date, tibble prints as <date>
  • A datetime, tibble prints as <dttm> also referred “POSIXct”
  • A time, tibble prints as <time> from hms

R doesn’t have a native class for time, but tidyverse (hms) offers it.

Simple Date and Time Vectors

today() and now() creates date and datetime class vectors.

library(tidyverse)

# Current date and time
current_date <- today()       
current_datetime <- now()      
class(current_date)
class(current_datetime)

Parse Date when Import

If external data has standard (i.e, ISO8601) date and datetime, read_csv() will automatically parse it.

csv <- "
  date, datetime
  2022-01-02,2022-01-02 05:12
"
read_csv(csv)

Parse Date with Manual Formatters

If external data has an ambiguous format, you can manually specify the format to handle.

# Which is day, month, year?
csv <- "
  date
  01/02/15
"
read_csv(csv, col_types = cols(date = col_date("%m/%d/%y")))
read_csv(csv, col_types = cols(date = col_date("%d/%m/%y")))
read_csv(csv, col_types = cols(date = col_date("%y/%m/%d")))

Date/Time formatters

Type Code Meaning Example
Year %Y 4 digit year 2021
%y 2 digit year 21
Month %m Number 2
%b Abbreviated name Feb
%B Full name February
Day %d One or two digits 2
Time %H 24-hour hour 13
%M Minutes 35
%S Seconds 45
%I 12-hour hour 1
%p AM/PM pm
%Z Time zone name America/Chicago

Exercise: Guess the Format!

Guess the correct Date/Time Format:

  1. “2021-07-25”
  2. “Jan, 1, 2011”
  3. “07/25/21”
  4. “2021-07-25 14:35:45”
  5. “07/25/2021 02:35 PM”
  6. “2021-07-25 14:35:45 EST”
  7. “25 July 2021”

Parse Strings to Date

Some cases are not handled perfectly by datetime format such as:

  • “May 1st, 2023”
  • “May 23rd, 2023”

lubridate package has nice handlers for those cases.

date_str1 <- "Dec 25th, 2017"
date_str2 <- "2020-05-18"
date_str3 <- "5 October, 2023"
date_str4 <- "09/25/1986"
date_str5 <- "1988-9-4"

mdy(date_str1)
ymd(date_str2)
dmy(date_str3)
mdy(date_str4)
ymd(date_str5)

Parse Strings to Datetime

lubridate package has nice handlers for datetime as well. Timezone must be specified correctly.

trade_datetime_24 <- "2023-05-15 09:30:00"
trade_datetime_12 <- "May 15, 2023 09:30 AM"
trade_datetime_24_tz <- "2023-05-15 09:30:00 EST"

ymd_hms(trade_datetime_24) # UTC by default
ymd_hms(trade_datetime_24, tz = "EST") # set time zone at EST
mdy_hm(trade_datetime_12)
ymd_hms(trade_datetime_24_tz) # time zone should be mentioned

Time zones

Time zones is not just a formatting. It changes underlying values especially when datetime is parsed from strings.

trade_datetime_24 <- "2023-05-15 09:30:00"
utc <- ymd_hms(trade_datetime_24, tz = "UTC")
est <- ymd_hms(trade_datetime_24, tz = "EST")

unclass(utc) # 1684143000
unclass(est) # 1684161000

Robust Time Zones

If you’re American you’ll know “EST” for Eastern Standard Time, but both Austrailia and Canada also have EST!

R uses international standard, IANA time zones, {area}/{location}.

# To browse all time zone names, OlsonNames.
OlsonNames()

Changing Time Zones

There are two scenarios that you want to change time zones:

  1. Keep the instance but change time formatting
  • Like converting time using world clock
  1. Keep the time formatting but change instance
  • Usually to fix the data error

Keep the instance but formatting

with_tz() will keep the instance but change the time zone.

  • As you would see from world clock!
datetime <- ymd_hms("2023-01-01 12:00:00", tz = "America/New_York")
print(datetime)

with_tz(datetime, tzone = "America/Los_Angeles")

Exercise

For following timezone, change time zone to Chicago keeping the instance.

market_close <- ymd_hms("2023-01-01 16:00:00", tz = "America/New_York")

Keep the formatting but instance

force_tz() will keep the time formatting but change the instance.

  • To fix the data error
market_close <- ymd_hms("2023-01-01 16:00:00", tz = "America/Chicago") 
print(market_close)

force_tz(market_close, tzone = "America/New_York")

Exercise

For following timezone, change time zone to Chicago keeping the instance.

market_close <- ymd_hms("2023-01-01 15:00:00", tz = "America/New_York")

Date/Time Components

You can pull out individual parts of the date with the accessor functions.

datetime <- ymd_hms("2026-07-08 12:34:56")
class(datetime) # POSIXct

year(datetime)
month(datetime)
day(datetime)
mday(datetime) # same as day()
yday(datetime)
wday(datetime)
wday(datetime, label = TRUE)
hour(datetime)
minute(datetime)
second(datetime)

Exercise

example_datetime <- now()

What is the

  • weekday
  • month day
  • yearday
  • month
  • second

of example_datetime?

Rounding Dates

In Finance, flooring date / time is often used to matche frequency and most relevant information at a specific time.

floor_date(), ceiling_date() and round_date()

last_traded_time <- ymd_hms("2024-09-08 13:33:45.653 EST", tz = "EST") # milliseconds
floor_date(last_traded_time) # by default, second
ceiling_date(last_traded_time)
floor_date(last_traded_time, unit = "10 seconds")
floor_date(last_traded_time, unit = "15 mins")
floor_date(last_traded_time, unit = "2 hours")

Exercise

  1. Convert the string into datetime object. Use Time zone: America/New_York
last_traded_time <- "2024-09-08 13:33:45.653 EST"
  1. Round down above trading time by “1 day”, “10 hours”, “5 minutes”, “10 seconds”

Missing Values

Missing Values

Missing values frequently appear in fianancial datasets.

Two types of missingness:

  • Explicit missing: values marked NA

    • Presence of absence
  • Implicit missing: absent rows that should be

    • Absense of presence

Example

financial_reports <- tibble(
  company = c("AAPL", "AAPL", "AAPL", "AAPL", "TSLA", "TSLA"),
  year = c(2020, 2020, 2020, 2021, 2021, 2021),
  quarter = c(1, 2, 3, 1, 1, 2),
  revenue = c(100, NA, 110, 200, 210, 220)
)
financial_reports |> 
  gt::gt()
company year quarter revenue
AAPL 2020 1 100
AAPL 2020 2 NA
AAPL 2020 3 110
AAPL 2021 1 200
TSLA 2021 1 210
TSLA 2021 2 220
  • Explicit misssing:

    • NA values on the revenue
  • Implicit missing:

    • AAPL: Q4 missing year 2020, …
    • TSLA: year 2020, and Q3, Q4 missing year 2021,

Implicit Missing Values

Generally, you want to reveal those implicit missing cases as explicit. tidyr::complete() is handy for this operation.

Note

tidyr is included in tidyverse.

# Provide set of variables of which combination should exist
financial_reports |> 
  complete(company, year, quarter) 
# A tibble: 12 × 4
   company  year quarter revenue
   <chr>   <dbl>   <dbl>   <dbl>
 1 AAPL     2020       1     100
 2 AAPL     2020       2      NA
 3 AAPL     2020       3     110
 4 AAPL     2021       1     200
 5 AAPL     2021       2      NA
 6 AAPL     2021       3      NA
 7 TSLA     2020       1      NA
 8 TSLA     2020       2      NA
 9 TSLA     2020       3      NA
10 TSLA     2021       1     210
11 TSLA     2021       2     220
12 TSLA     2021       3      NA

Since Q4 was missing for all, complete() fails to make every missing values explicit.

In this case, you can provide your own data.

Explicit Missing Values

There are roughly 3 methods to handle missing values in Finance:

  1. Filling with designated value
  • e.g., Replace NA to 0
  1. Last observation carried forward
  • e.g., Use most recent past observation to fill NA
  • c.f., Next observation carried backward
  1. Linear Interpolation
  • e.g., fill NA with incremental values in between

Filling with designated value

Heuristic approach where you simply know (or assume) NA values. ifelse() is useful technique.

financial_reports_ex |> 
    mutate(revenue_filled = ifelse(is.na(revenue), 100, revenue))
# A tibble: 16 × 5
   company  year quarter revenue revenue_filled
   <chr>   <dbl>   <dbl>   <dbl>          <dbl>
 1 AAPL     2020       1     100            100
 2 AAPL     2020       2      NA            100
 3 AAPL     2020       3     110            110
 4 AAPL     2020       4      NA            100
 5 AAPL     2021       1     200            200
 6 AAPL     2021       2      NA            100
 7 AAPL     2021       3      NA            100
 8 AAPL     2021       4      NA            100
 9 TSLA     2020       1      NA            100
10 TSLA     2020       2      NA            100
11 TSLA     2020       3      NA            100
12 TSLA     2020       4      NA            100
13 TSLA     2021       1     210            210
14 TSLA     2021       2     220            220
15 TSLA     2021       3      NA            100
16 TSLA     2021       4      NA            100

Or you can use tidyr::replace_na() function.

financial_reports_ex |> 
    replace_na(list(revenue = 100))
# A tibble: 16 × 4
   company  year quarter revenue
   <chr>   <dbl>   <dbl>   <dbl>
 1 AAPL     2020       1     100
 2 AAPL     2020       2     100
 3 AAPL     2020       3     110
 4 AAPL     2020       4     100
 5 AAPL     2021       1     200
 6 AAPL     2021       2     100
 7 AAPL     2021       3     100
 8 AAPL     2021       4     100
 9 TSLA     2020       1     100
10 TSLA     2020       2     100
11 TSLA     2020       3     100
12 TSLA     2020       4     100
13 TSLA     2021       1     210
14 TSLA     2021       2     220
15 TSLA     2021       3     100
16 TSLA     2021       4     100

Last observation carried forward

tidyr::fill() offers convenient filling options. It works like select() function.

- "down": fill downwards (LOCF)
- "up": fill upwards (NOCB)
- "downup": LOCF then NOCB
- "updown": NOCB then LOCF

When filling in direction, grouping and arranging is important.

# Wrong case: using Apple's revenue to fill Tesla
financial_reports_ex |> 
    fill(revenue, .direction = "down")

If you use LOCF, below is the correct approach:

financial_reports_ex |> 
    arrange(company, year, quarter) |> 
    group_by(company) |> 
    fill(revenue, .direction = "down")

Linear Interpolation

Interpolation is when you want to estimate a value between two known points. approx() function is a handy tool.

By default, it makes 50 split along the length of the vector and give esimated values.

daily_return <- c(0.01, NA, -0.01, NA, 0.05) # considered as y axis
length(daily_return) # length 5
[1] 5
approx(daily_return) 
$x
 [1] 1.000000 1.081633 1.163265 1.244898 1.326531 1.408163 1.489796 1.571429
 [9] 1.653061 1.734694 1.816327 1.897959 1.979592 2.061224 2.142857 2.224490
[17] 2.306122 2.387755 2.469388 2.551020 2.632653 2.714286 2.795918 2.877551
[25] 2.959184 3.040816 3.122449 3.204082 3.285714 3.367347 3.448980 3.530612
[33] 3.612245 3.693878 3.775510 3.857143 3.938776 4.020408 4.102041 4.183673
[41] 4.265306 4.346939 4.428571 4.510204 4.591837 4.673469 4.755102 4.836735
[49] 4.918367 5.000000

$y
 [1]  0.0100000000  0.0091836735  0.0083673469  0.0075510204  0.0067346939
 [6]  0.0059183673  0.0051020408  0.0042857143  0.0034693878  0.0026530612
[11]  0.0018367347  0.0010204082  0.0002040816 -0.0006122449 -0.0014285714
[16] -0.0022448980 -0.0030612245 -0.0038775510 -0.0046938776 -0.0055102041
[21] -0.0063265306 -0.0071428571 -0.0079591837 -0.0087755102 -0.0095918367
[26] -0.0087755102 -0.0063265306 -0.0038775510 -0.0014285714  0.0010204082
[31]  0.0034693878  0.0059183673  0.0083673469  0.0108163265  0.0132653061
[36]  0.0157142857  0.0181632653  0.0206122449  0.0230612245  0.0255102041
[41]  0.0279591837  0.0304081633  0.0328571429  0.0353061224  0.0377551020
[46]  0.0402040816  0.0426530612  0.0451020408  0.0475510204  0.0500000000

You can get only certain observations with xout argument. Notice it generates a list output with x and y.

approx(daily_return, xout = 1:5)
$x
[1] 1 2 3 4 5

$y
[1]  0.01  0.00 -0.01  0.02  0.05

To pull the interpolated results, access y from the result.

approx(daily_return, xout = 1:5)$y
[1]  0.01  0.00 -0.01  0.02  0.05

It is easy to visualize the results from linear approximation:

approx(daily_return, xout = 1:5) |> 
    as_tibble() |> 
    ggplot(aes(x = x, y = y)) +
    geom_point() + 
    geom_line() + 
    theme_bw()

If you have values to specify for x-axis to calculate slope:

day <- c(1, 2, 16, 20, 26) # If returns are from day 1, 2, 5, 10, 16
approx(x= day, y = daily_return, xout = day) |> 
    as_tibble() |> 
    ggplot(aes(x = x, y = y)) +
    geom_point() +
    geom_line() +
    theme_bw()

Treasury Yield Interpolation

For example, fill the straight line estimate for “9 month” yield.

The treasury daily yield data looks like below.

treasury_data <- tibble(
  date = as.Date(c("2025-04-01", "2025-04-02", "2025-04-03", "2025-04-04")),
  x6_mo = c(4.23, 4.24, 4.20, 4.14),
  x1_yr = c(4.01, 4.04, 3.92, 3.86),
  x2_yr = c(3.87, 3.91, 3.71, 3.68)
)

print(treasury_data)
# A tibble: 4 × 4
  date       x6_mo x1_yr x2_yr
  <date>     <dbl> <dbl> <dbl>
1 2025-04-01  4.23  4.01  3.87
2 2025-04-02  4.24  4.04  3.91
3 2025-04-03  4.2   3.92  3.71
4 2025-04-04  4.14  3.86  3.68

To interpolate, you’ll need to pivot the data and make an explicit missing value:

treasury_interpolate <- treasury_data |> 
    pivot_longer(cols = !date) |> 
    complete(date, name = c("x6_mo", "x9_mo","x1_yr","x2_yr")) |> 
    mutate(name = fct(name, levels = c("x6_mo","x9_mo","x1_yr","x2_yr"))) |> # ordering
    arrange(date, name)
treasury_interpolate
# A tibble: 16 × 3
   date       name  value
   <date>     <fct> <dbl>
 1 2025-04-01 x6_mo  4.23
 2 2025-04-01 x9_mo NA   
 3 2025-04-01 x1_yr  4.01
 4 2025-04-01 x2_yr  3.87
 5 2025-04-02 x6_mo  4.24
 6 2025-04-02 x9_mo NA   
 7 2025-04-02 x1_yr  4.04
 8 2025-04-02 x2_yr  3.91
 9 2025-04-03 x6_mo  4.2 
10 2025-04-03 x9_mo NA   
11 2025-04-03 x1_yr  3.92
12 2025-04-03 x2_yr  3.71
13 2025-04-04 x6_mo  4.14
14 2025-04-04 x9_mo NA   
15 2025-04-04 x1_yr  3.86
16 2025-04-04 x2_yr  3.68

Then, generate a numeric column to help interpolating the yield estimates.

treasury_interpolate <- treasury_interpolate |> 
    mutate(
        days = case_when(
            name == "x6_mo" ~ 180,
            name == "x9_mo" ~ 270,
            name == "x1_yr" ~ 360,
            name == "x2_yr" ~ 720,
        )
    )
head(treasury_interpolate)
# A tibble: 6 × 4
  date       name  value  days
  <date>     <fct> <dbl> <dbl>
1 2025-04-01 x6_mo  4.23   180
2 2025-04-01 x9_mo NA      270
3 2025-04-01 x1_yr  4.01   360
4 2025-04-01 x2_yr  3.87   720
5 2025-04-02 x6_mo  4.24   180
6 2025-04-02 x9_mo NA      270

Finally, interpolate with approx() function. Notice the use of group_by() in this operation.

treasury_interpolate |> 
    group_by(date) |> 
    mutate(value_interpolated = approx(x = days, y = value, xout = days)$y)
# A tibble: 16 × 5
# Groups:   date [4]
   date       name  value  days value_interpolated
   <date>     <fct> <dbl> <dbl>              <dbl>
 1 2025-04-01 x6_mo  4.23   180               4.23
 2 2025-04-01 x9_mo NA      270               4.12
 3 2025-04-01 x1_yr  4.01   360               4.01
 4 2025-04-01 x2_yr  3.87   720               3.87
 5 2025-04-02 x6_mo  4.24   180               4.24
 6 2025-04-02 x9_mo NA      270               4.14
 7 2025-04-02 x1_yr  4.04   360               4.04
 8 2025-04-02 x2_yr  3.91   720               3.91
 9 2025-04-03 x6_mo  4.2    180               4.2 
10 2025-04-03 x9_mo NA      270               4.06
11 2025-04-03 x1_yr  3.92   360               3.92
12 2025-04-03 x2_yr  3.71   720               3.71
13 2025-04-04 x6_mo  4.14   180               4.14
14 2025-04-04 x9_mo NA      270               4   
15 2025-04-04 x1_yr  3.86   360               3.86
16 2025-04-04 x2_yr  3.68   720               3.68

Exercise

stock_returns <- tibble(
  date = as.Date(c("2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05")),
  return = c(0.01, NA, -0.01, NA, 0.02)
)
  1. Use ifelse() to create a new column return_filled where missing returns are filled with 0, assuming no change in stock price on those days.
  2. Use tidyr::replace_na() to achieve the same result, replacing NA values with 0 in the return column.
tesla_revenue <- tibble(
  company = "TSLA",
  year = c(2020, 2020, 2021),
  quarter = c(1, 3, 2),
  revenue = c(100, 110, 120)
)
  1. Use tidyr::complete() to add the missing quarters for 2020 and 2021. Assume that each year should have quarters 1 to 4 (Q1, Q2, Q3, Q4). The missing revenue values should appear as NA.

  2. Fill the missing revenue values using the LOCF method. Ensure the data is properly arranged by year and quarter before applying tidyr::fill().

treasury_yields <- tibble(
  maturity = c("6 Mo", "9 Mo", "1 Yr", "2 Yr"),
  days = c(180, 270, 360, 720),
  yield = c(4.23, NA, 4.01, 3.87)
)
  1. Use the approx()function to interpolate the yield for “9 Mo” based on the days and yield columns. Provide the interpolated yield value as your answer.

Character Vectors

Strings

Characters (Strings) store text information in finance such as

  • earnings announcements
  • analyst opinions
  • descriptions
  • investment sentiments, etc.

We’ll mostly use stringr package (included in tidyverse)

Generate Strings

You can create strings by wrapping values with singgle quote (') or double quotes (").

announcement1 <- 'Tariff will be 34%'
announcement2 <- "Actually, it will be 125%!"
quote_in_quotes <- 'If you want to include "quote" inside, mix \' and \"'

Escape Strings

Special characters (quotes, backslash, backticks, etc.) has their reserved use, and if you want to include them, you have to escape with backslash \.

double_quotes <- c("\"", '"')
print(double_quotes) 
cat(double_quotes)
str_view(double_quotes)

backslashes <- c("\\", '\\')
print(backslashes)
cat(backslashes)
str_view(backslashes)

Other Special Characters

There are some other special characters worth remembering:

  • \n newline
  • \t tab
  • \U Unicode escapes
x <- c("one\ntwo", "one\ttwo", "\U0001f604")
print(double_quotes) # print gives you the structure
cat(x)
str_view(x)

Tricky Escapes

Creating a string with multiple quotes, backslashes, gets confusing so quickly! For example:

tricky <- "
Without raw strings,
Double backslashes \\\\ and 
double double quotes '\"\"' with quotes
will make you crazy."
cat(tricky)

Without raw strings,
Double backslashes \\ and 
double double quotes '""' with quotes
will make you crazy.

This is called Leaning Toothpick Syndrome

Raw Strings

To eliminate escaping, you can use raw string with r"()", r"{}", r"[]".

not_so_tricky <- r"(
With raw strings,
writing double backslashes \\ and
double double quotes '""' wrapped with quotes
will not be so hard.
)"
cat(not_so_tricky)

With raw strings,
writing double backslashes \\ and
double double quotes '""' wrapped with quotes
will not be so hard.

Exercise

Create strings that contain the following values:

  • He said “We have a beautiful announcement today!”
  • \a\b\c\d
  • \\\

Creating strings

str_c() concatenates multiple string vectors, element-wise.

str_c("A", "B") # scalar
[1] "AB"
str_c(c("a", "b"), c("c","d")) # vectorized
[1] "ac" "bd"

For example, combine a financial report header with

str_c("Report for", c("Acme Corp", "Beta Inc"))
[1] "Report forAcme Corp" "Report forBeta Inc" 
str_c("Report for", c("Acme Corp", "Beta Inc"), sep = " ") # Use separator
[1] "Report for Acme Corp" "Report for Beta Inc" 

glue strings

str_glue() improves readability by allowing embedded expressions within {}.

company <- "Microsoft Inc."
str_glue("Earnings: {company} reported strong results.")
Earnings: Microsoft Inc. reported strong results.

Also works with vectorized operations with recycling.

companies <- c("Apple Inc.","Microsoft Inc.")
str_glue("Earnings: {companies} reported strong results.") # vectorized
Earnings: Apple Inc. reported strong results.
Earnings: Microsoft Inc. reported strong results.

flatten strings

If you want to collapse a vector of strings into a single string, str_flatten(), or paste()

forecast <- c("There", "will", "be", "a", "strong", "market", "volatility.")
str_flatten(forecast)
[1] "Therewillbeastrongmarketvolatility."
str_flatten(forecast, collapse = " ") # space between collapsing
[1] "There will be a strong market volatility."

Base R: paste() and collapse.

paste(forecast, collapse = " ")
[1] "There will be a strong market volatility."

Exercise

  1. Check out the length of the vector and length of the string of:
companies <- c("Alphabet Corp", "Beta Inc", "Gamma LLC")
  1. Flatten the above companies character vector into a scalar string.

  2. Fix below code to evaluate embed expression companies then print:

announcement <- "The {companies} is performing well."
print(announcement) # "companies" is not evaluated. How to fix it?
[1] "The {companies} is performing well."

Letters in Strings

Two relevant concepts related to the length:

  • number of elementsin a vector: length()
  • the number of characters for each elements: str_length() or nchar()
example_string <- c("risk free rates", "earnings announcements")
length(example_string)
[1] 2
str_length(example_string)
[1] 15 22
nchar(example_string)
[1] 15 22

Subsetting Letters

You can extract parts of a string using position arguements with str_sub()

companies <- c("Apple Inc.", "Sackson LLC", "Zeta Investments")
str_sub(companies, 1, 2)
[1] "Ap" "Sa" "Ze"
str_sub(companies, -1) # last letter
[1] "." "C" "s"
str_sub(companies, 5, -1) # 5th to last
[1] "e Inc."       "son LLC"      " Investments"
str_sub(companies, -5, -1) # last 5th to last
[1] " Inc." "n LLC" "ments"

Pad strings

str_pad() pads a string to fixed length by adding extra whitespace on the left, right or both.

x <- c("Apple", "Microsoft")
str_pad(x, 10) # makes 10 character by adding whitespace on left
[1] "     Apple" " Microsoft"
str_pad(x, 10, side = "right")
[1] "Apple     " "Microsoft "
str_pad(x, 10, side = "both")
[1] "  Apple   " "Microsoft "

You can pad other strings, for example, leading zeros:

number_string <- c("1","23","359")
str_pad(number_string, width = 3, pad = "0") # 3 character with leading zeros
[1] "001" "023" "359"

Lettercases

Upper / lowercase transfromations:

x <- "Economists say that it can cause stagflation."
str_to_upper(x) # uppercase
[1] "ECONOMISTS SAY THAT IT CAN CAUSE STAGFLATION."
str_to_lower(x) # lowercase
[1] "economists say that it can cause stagflation."
str_to_title(x) # Title cases
[1] "Economists Say That It Can Cause Stagflation."

Exercises

companies <- c("Apple Inc.", "Sackson LLC", "Zeta Investments")
  1. Extract the first two characters from companies.

  2. Extract the last two characters.

  3. Transform to lowercases and uppercases.

number_string <- c("1","23","359")
  1. Pad “0” to the left so that it has 4 letters (e.g, “0001”).

Regular Expressions

RegEx

Regular Expressions (Regex) is a language for describing “patterns” within strings.

  • Regex is a core tool for working with text data
  • Widely supported in stringr, tidyverse, and base R

Prep

We’ll use regular expression functions from the stringr and tidyr packages, both core members of the tidyverse.

# install.packages("newsanchor")
library(tidyverse)
library(newsanchor) # For financial news data

Datasets and Examples

To explore regular expressions, we’ll use:

Three character vectors from the stringr package:

  • fruit: names of 80 fruits
  • words: 980 common English words
  • sentences: 720 short example sentences

These built-in datasets are great for testing regex.

Pattern Basics

str_view() highlights matches in a string vector using <>.

Literal characters match exactly:

str_view(fruit, "berry")
 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>

Metacharacters and Wildcards

Some characters, like ., +, and *, have special meanings in regex and are known as metacharacters.

.: A wildcard that matches any single character. For example:

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>

Pattern Length with Wildcards

You can match specific lengths of text using . repeated:

str_view(fruit, "a...e")
 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry

This matches an “a” followed by any three characters and an “e”.

Quantifiers

Quantifiers control how often a pattern appears:

  • ?: 0 or 1 time (optional)
str_view(c("a", "ab", "abb", "abbb", "abc"), "ab?") # ? applied to "b"
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[4] │ <ab>bb
[5] │ <ab>c
  • +: 1 or more times
str_view(c("a", "ab", "abb", "abbb", "abc"), "ab+") # + applied to "b"
[2] │ <ab>
[3] │ <abb>
[4] │ <abbb>
[5] │ <ab>c
  • *: 0 or more times
str_view(c("a", "ab", "abb", "abbb", "abc"), "ab*") # * applied to "b"
[1] │ <a>
[2] │ <ab>
[3] │ <abb>
[4] │ <abbb>
[5] │ <ab>c

Exercise

my_words <- c("ca", "oca", "caan", "cat", "call", "candy")
  1. Use str_view() to highlight pattern “ca”

  2. Use str_view() to highlight pattern “ca” and following exactly one character (hint: .)

Character Set

Use brackets [] to define and match sets of characters. It is also called as character class. For example, [aeiou] matches any vowel.

str_view(words, "[aeiou]x[aeiou]") # Words containing x surrounded by vowels
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
str_view(words, "[^aeiou]y[^aeiou]") # Words containing y surrounded by consonants
[836] │ <sys>tem
[901] │ <typ>e

The caret ^ inside brackets negates the set.

Caution

The caret ^ outside of brackets has a different meaning: it anchors the match to the beginning of the string.

Alternation (OR)

Use | to match one of several patterns:

str_view(fruit, "apple|melon|nut")
 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
str_view(fruit, "aa|ee|ii|oo|uu")
 [9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n

This finds fruits containing specified keywords or repeated vowels.

Exercise

another_words <- c("taxi", "flux", "pixie", "axial", "exude")
  1. Use str_view() to highlight patten where “x” is surrounded by vowels

  2. Use str_view() to match words containing any of “flux” or “pixie”

Key functions

str_view() is good to experiment on pattern matching. Other key functions are:

  • str_detect(): logical check if pattern exists
  • str_subset(): subset elements that contains patterns
  • str_count(): count the occurrence of pattern
  • str_replace(): replace patterns
  • separate_...(): separate by pattterns

Use case: Detect Matches

In real data, you can use str_detect() to check for the presence of a pattern.

  • It returns logical vector; ideal for filtering
str_detect(c("a", "b", "c"), "[aeiou]")
[1]  TRUE FALSE FALSE

str_subset() and str_which()

Two other useful functions are:

  • str_subset(): returns the elements that contains pattern
  • str_which(): returns the number index of elements that has pattern

Example:

str_subset(sentences, "is") |> head()
[1] "These days a chicken leg is a rare dish."         
[2] "Rice is often served in round bowls."             
[3] "A large size in stockings is hard to sell."       
[4] "A rod is used to catch pink salmon."              
[5] "The source of the huge river is the clear spring."
[6] "The fish twisted and turned on the bent hook."    
str_which(sentences, "is") |> head()
[1]  4  5 10 12 13 22
# sentences[str_which(sentences, "is")] 

You can use these to extract or locate matches without altering the original data structure.

Count Matches with str_count()

Check for repeated sequences:

str_count("Dogecoin is showing signs of strength that is related", "is")
[1] 2

Count the number of matches per string:

str_view("Dogecoin is showing signs of strength that is related", "is")
[1] │ Dogecoin <is> showing signs of strength that <is> related

Case Sensitivity in Regex

Sometimes your results may look off. For example, the name “Aaban” has three “a”s, but only two are counted. That’s because regex is case sensitive by default.

You can fix this in three ways:

  • Add uppercase characters to the pattern:
name <- "Aaban"
str_view(name, "[aeiouAEIOU]")
[1] │ <A><a>b<a>n
  • Use regex(..., ignore_case = TRUE):
str_view(name, regex("[aeiou]", ignore_case = TRUE))
[1] │ <A><a>b<a>n
  • Preprocess the string to lowercase:
str_view(str_to_lower(name), "[aeiou]")
[1] │ <a><a>b<a>n

Replace Values

str_replace() replaces the first match. str_replace_all() replaces all match.

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
[1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-"  "p--r"   "b-n-n-"

Extract and Separate Variables

In tibble (dataframe), you can separate text into variables by

  • delimiter,
  • position and
  • pattern (regex)

Separate by delimiter

separate_longer_delim() separates values into long form.

df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 |> 
  separate_longer_delim(x, delim = ",")
# A tibble: 6 × 1
  x    
  <chr>
1 a    
2 b    
3 c    
4 d    
5 e    
6 f    

separate_wider_delim() separates values into wide form. You must specify names, and actions if too few or too many.

df1 |> 
  separate_wider_delim(
    x, 
    delim = ",", 
    names = c("first","second","third"),
    too_few = "align_start")
# A tibble: 3 × 3
  first second third
  <chr> <chr>  <chr>
1 a     b      c    
2 d     e      <NA> 
3 f     <NA>   <NA> 
df1 |> 
  separate_wider_delim(
    x, 
    delim = ",", 
    names = c("first","second"),
    too_few = "align_start", 
    too_many = "debug")
# A tibble: 3 × 6
  first second x     x_ok  x_pieces x_remainder
  <chr> <chr>  <chr> <lgl>    <int> <chr>      
1 a     b      a,b,c FALSE        3 ",c"       
2 d     e      d,e   TRUE         2 ""         
3 f     <NA>   f     TRUE         1 ""         

Separate by position

separate_longer_position() splits by fixed width. Must specify width.

df2 <- tibble(x = c("1211", "131", "21"))
df2 |> 
  separate_longer_position(x, width = 2)
# A tibble: 5 × 1
  x    
  <chr>
1 12   
2 11   
3 13   
4 1    
5 21   

separate_wider_position() separates values into wide form.

You must specify widths with named integer vector, and actions if too few or too many.

df3 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df3 |> 
  separate_wider_position(
    x, 
    widths = c(year = 4, age = 2, state = 2))
# A tibble: 3 × 3
  year  age   state
  <chr> <chr> <chr>
1 2022  15    TX   
2 2021  22    LA   
3 2023  25    CA   

Separate by regex

When you want to separate by regex patterns. Below is a complex sample:

df <- tribble(
  ~string,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Brandon>-M_33",
  "<Sharon>-F_38"
)
print(df)
# A tibble: 4 × 1
  string        
  <chr>         
1 <Sheryl>-F_34 
2 <Kisha>-F_45  
3 <Brandon>-M_33
4 <Sharon>-F_38 

Use separate_wider_regex() to extract structured data:

df |> 
  separate_wider_regex(
    string,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".",
      "_",
      age = "[0-9]+"
    )
  )
# A tibble: 4 × 3
  name    gender age  
  <chr>   <chr>  <chr>
1 Sheryl  F      34   
2 Kisha   F      45   
3 Brandon M      33   
4 Sharon  F      38   

Exercise

test_strings <- c("apple", "banana", "cherry", "date", "fig", "grape123")
  1. str_detect() to indicate whether each string contains a digit (hint [0-9]).
  2. str_count() to count the number of vowels.
  3. str_replace_all() to replace “a” to “e”
  4. separate_wider to separate fruit to two variables by whitespace.
  • Generate tibble from fruit first

Escaping

To literally match metacharacters (., ?, *) in regex, use \.

  • To literal match ., regex pattern should be \.
  • Regex patterns are given in strings
  • However, strings also escape \ with \
  • You should use "\\." to express \.
dot <- "\\."
str_view(dot) # Literally \.
[1] │ \.
str_view(c("abc", "a.c", "bef"), pattern = "a\\.c")
[2] │ <a.c>

To match ?, you need regex \?, and to express it \\?.

str_view(c("Is this crazy?"), "\\?")
[1] │ Is this crazy<?>

To match \, you need regex \\, and to express it \\\\

str_view(c("\\"), "\\\\") # wow!
[1] │ <\>

If you use raw strings in regex, it reduces one level of escaping.

str_view(c("Is this crazy?"), r"{\?}")
[1] │ Is this crazy<?>
str_view(c(r"(\)"), r"[\\]") 
[1] │ <\>

Or you can escape with character set [] for some (not all) metacharacters.

  • Still \ cannot be used with character set
str_view(c("a*c", "a?c"), "[*?]")
[1] │ a<*>c
[2] │ a<?>c

Anchors

If you want to match at the start or end you need to anchors

  • ^ match the start
  • $ match the end
  • \b match boundary between words
# Match the start
str_view(c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)"), "^sum")
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[4] │ <sum>(x)

Word boundary example:

# Match word boundary
str_view(c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)"), "\\bsum\\b")
[4] │ <sum>(x)

Match end:

str_view(c("surprising", "rising sun"), "rising$")
[1] │ surp<rising>

Anchors also match zero-width if used alone:

str_view(c("apple banana monkey boom") , pattern = c("^", "$", "\\b"))
[1] │ <>apple banana monkey boom
[2] │ apple banana monkey boom<>
[3] │ <>apple<> <>banana<> <>monkey<> <>boom<>

You can use this feature for replacements:

str_replace_all(c("apple banana monkey boom") , pattern = c("^", "$", "\\b"), "--")
[1] "--apple banana monkey boom"              
[2] "apple banana monkey boom--"              
[3] "--apple-- --banana-- --monkey-- --boom--"

Character Set

Use [] to match any character from a set.

  • e.g., [abc]
  • [^abc] to exclude
  • [a-z] - defines range
  • \ escapes special characters within []
x <- "abcd ABCD 12345 -!@#%."
str_view(x, "[abc]+")
[1] │ <abc>d ABCD 12345 -!@#%.
str_view(x, "[a-z]+")
[1] │ <abcd> ABCD 12345 -!@#%.
str_view(x, "[^a-z0-9]+") # Note whitespace matching
[1] │ abcd< ABCD >12345< -!@#%.>
# You need an escape to match characters that are otherwise
str_view("a-b-c", "[a-c]") # a to c
[1] │ <a>-<b>-<c>
str_view("a-b-c", "[a\\-c]") # a, -, c
[1] │ <a><->b<-><c>

Character Set Shortcuts

Some character sets are so common that they have shortcuts:

  • \d any digit
  • \D anything not digit
  • \s any whitespace (space, tab, newline)
  • \S anything not whitespace
  • \w any word (letters and numbers)
  • \W any non-word
x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+" )  # digits
[1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+" )  # non-digits
[1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+" )  # space
[1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+" )  # non-space
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+" )  # word
[1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+" )  # non-word
[1] │ abcd< >ABCD< >12345< -!@#%.>

Quantifiers

On top of

  • ?(0 or 1)
  • +(1 or more)
  • *(0 or more)

You can specify precise quantifiers

  • {n} exactly n times
  • {n,} at least n times
  • {n,m} between n and m times

Operator Precedence

Regular expressions follow precedence rules like math:

  • Quantifiers (+, ?): high
  • Alternation (|): low

You can use () to specify precedence and grouping.

str_view(c("ab", "abab", "abb", "abbb"), "ab+")  # means a(b+)
[1] │ <ab>
[2] │ <ab><ab>
[3] │ <abb>
[4] │ <abbb>
str_view(c("apple", "banana"), "^a|b$")   # means (^a)|(b$)
[1] │ <a>pple

Grouping

Parenthesis () can also used for capturing groups.

Use \1, \2, etc., to refer back to matched groups.

str_view(fruit, "(..)\\1") # repeated pair of letters
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
str_view(words, "^(..).*\\1$") # start and end with same pair of letters
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>

Use back references to replice:

sentences |> 
  head(3) |> 
  as_tibble() |> 
  mutate(new = str_replace(value, "(\\w+) (\\w+) (\\w+)", "\\1 \\3 \\2"))
# A tibble: 3 × 2
  value                                       new                               
  <chr>                                       <chr>                             
1 The birch canoe slid on the smooth planks.  The canoe birch slid on the smoot…
2 Glue the sheet to the dark blue background. Glue sheet the to the dark blue b…
3 It's easy to tell the depth of a well.      It's to easy tell the depth of a …

You can extract match with str_match() that returns matrix:

sentences |> 
  head(3) |> 
  str_match("the (\\w+) (\\w+)")
     [,1]                [,2]     [,3]    
[1,] "the smooth planks" "smooth" "planks"
[2,] "the sheet to"      "sheet"  "to"    
[3,] "the depth of"      "depth"  "of"    

You can extract match with str_match() that returns matrix:

sentences |> 
  head(3) |> 
  str_match("the (\\w+) (\\w+)")
     [,1]                [,2]     [,3]    
[1,] "the smooth planks" "smooth" "planks"
[2,] "the sheet to"      "sheet"  "to"    
[3,] "the depth of"      "depth"  "of"    

Convert to tibble:

sentences |> 
  head(3) |> 
  str_match("the (\\w+) (\\w+)") |> 
  as_tibble(.name_repair = "minimal") |> 
  set_names("match", "word1", "word2")
# A tibble: 3 × 3
  match             word1  word2 
  <chr>             <chr>  <chr> 
1 the smooth planks smooth planks
2 the sheet to      sheet  to    
3 the depth of      depth  of    

Or use separate_wider_regex()

sentences |> 
  head(3) |> 
  as_tibble() |> 
  separate_wider_regex(
    value, 
    c(".*", "the", " ", word1 = "\\w+", " ", word2 = "\\w+", ".*") # pattern has to explain the whole
  )
# A tibble: 3 × 2
  word1  word2 
  <chr>  <chr> 
1 smooth planks
2 dark   blue  
3 depth  of    

When you want to use () purely for grouping, not for capturing:

  • (?:) is non-capturing group
x <- c("a gray cat", "a grey dog")
str_match(x, "gr(e|a)y") # grey, gray
     [,1]   [,2]
[1,] "gray" "a" 
[2,] "grey" "e" 
str_match(x, "gr(?:e|a)y")
     [,1]  
[1,] "gray"
[2,] "grey"
str_match(x, "gre|ay") # gre, ay
     [,1] 
[1,] "ay" 
[2,] "gre"

Exercise

  1. How would you match the literal string "'\ ? How about "$^$"

  2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

  • Start with “y”.
  • Don’t start with “y”.
  • End with “x”.
  • Are exactly three letters long. (Don’t cheat by using str_length()!)
  • Have seven letters or more.
  • Contain a vowel-consonant pair.
  • Contain at least two vowel-consonant pairs in a row.
  • Only consist of repeated vowel-consonant pairs.
  1. Switch the first and last letters in words. Which of those strings are still words?

  2. Describe in words what these regular expressions match. Rread carefully to see if each entry is a regular expression or a string that defines a regular expression.

  1. ^.*$
  2. "\\{.+\\}"
  3. \d{4}-\d{2}-\d{2}
  4. "\\\\{4}"
  5. \..\..\..
  6. (.)\1\1
  7. "(..)\\1"

Pattern Control

regex() gives more contol over pattern object, using flags.

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
[1] │ <banana>
str_view(bananas, regex("banana", ignore_case = TRUE)) # ignore_case flag
[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>

Dotall flag allows . to match all, including \n.

x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")
str_view(x, regex(".Line", dotall = TRUE)) # dotall flag: . matches even \n
[1] │ Line 1<
    │ Line> 2<
    │ Line> 3

Multiline makes ^ and $ match start and end of each line.

x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")
[1] │ <Line> 1
    │ Line 2
    │ Line 3
str_view(x, regex("^Line", multiline = TRUE))
[1] │ <Line> 1
    │ <Line> 2
    │ <Line> 3

Comments allows you to write comment on complex patterns.

phone <- regex(
  r"(
    \(?     # optional opening parens
    (\d{3}) # area code capturing group
    [)\-]?  # optional closing parens or dash
    \ ?     # optional space
    (\d{3}) # another three numbers group
    [\ -]?  # optional space or dash
    (\d{4}) # four digits group
  )", 
  comments = TRUE
)

str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)
[1] "514-791-8141"   "(123) 456 7890" NA              

Fixed matches

Opt-out regular expression rules using fixed()

str_view(c("", "a", "."), fixed(".")) # . is litterally dot, not metacharacer in regex
[3] │ <.>

Examples

  1. Find all sentences that start with “The”.
str_view(sentences, "^The") |> head() # Not correct because of "These"
 [1] │ <The> birch canoe slid on the smooth planks.
 [4] │ <The>se days a chicken leg is a rare dish.
 [6] │ <The> juice of lemons makes fine punch.
 [7] │ <The> box was thrown beside the parked truck.
 [8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
str_view(sentences, "^The\\b") |> head()
 [1] │ <The> birch canoe slid on the smooth planks.
 [6] │ <The> juice of lemons makes fine punch.
 [7] │ <The> box was thrown beside the parked truck.
 [8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[13] │ <The> source of the huge river is the clear spring.
  1. Find all sentences begin with a pronoun
str_view(sentences, "^He|She|It|They\\b") |> head() # Fail: "Help", "Her"
 [3] │ <It>'s easy to tell the depth of a well.
[15] │ <He>lp the woman get back to her feet.
[27] │ <He>r purse was full of useless trash.
[29] │ <It> snowed, rained, and hailed the same morning.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
# It thinks ^He, She, It, They\\b
str_view(sentences, "^(He|She|It|They)\\b") |> head() # ^ (He, She, It, They)
  [3] │ <It>'s easy to tell the depth of a well.
 [29] │ <It> snowed, rained, and hailed the same morning.
 [63] │ <He> ran half way to the hardware store.
 [90] │ <He> lay prone and hardly moved a limb.
[116] │ <He> ordered peach pie with ice cream.
[127] │ <It> caught its hind paw in a rusty trap.

Best Practices

How to spot such mistakes? Create few positive and negative examples and test.

pos <- c("He is a boy", "She had a good time")
neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")

pattern <- "^(She|He|It|They)\\b"
str_detect(pos, pattern)
[1] TRUE TRUE
str_detect(neg, pattern)
[1] FALSE FALSE

Create pattern with code

If you wanted to find all sentences that mention a color?

str_view(sentences, "\\b(red|green|blue)\\b") |> head()
  [2] │ Glue the sheet to the dark <blue> background.
 [26] │ Two <blue> fish swam in the tank.
 [92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
[174] │ The sky that morning was clear and bright <blue>.

What if the colors are so many and stored in data, like:

colors() |> head(10)

First you want to remove numbers from colors:

cols <- colors() # copy object
cols <- str_subset(cols, "\\d", negate = TRUE)
cols |> head(10)
 [1] "white"          "aliceblue"      "antiquewhite"   "aquamarine"    
 [5] "azure"          "beige"          "bisque"         "black"         
 [9] "blanchedalmond" "blue"          

Now you can generate patterns using R code:

str_c("\\b(", str_flatten(cols, collapse = "|"), ")\\b")

Application: Financial News

To fetch news data, you’ll need API key from

https://newsapi.org

library(newsanchor)
news_data <- get_everything(
  "financial markets",
  language = "en",
  page = 1,
  # api_key = your_api_key,
)

News data prep

The data.frame is stored in second level.

news_frame <- news_data[[2]] |> as_tibble()
news_frame |> head()
# A tibble: 6 × 9
  author  title description url   url_to_image published_at        content id   
  <chr>   <chr> <chr>       <chr> <chr>        <dttm>              <chr>   <chr>
1 Axal S… Ital… "Italy Ban… http… https://med… 2026-02-07 20:20:35 "&lt;d… <NA> 
2 Waqas   Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA> 
3 Oluwap… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA> 
4 Editor  Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA> 
5 Diana … How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA> 
6 Editor… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA> 
# ℹ 1 more variable: name <chr>

Application: Financial News

To filter financial news that mention “uncertain”:

news_frame |> 
  filter(str_detect(str_to_lower(description), "uncertain")) |> 
  select(author, title, description) |> 
  head()
# A tibble: 0 × 3
# ℹ 3 variables: author <chr>, title <chr>, description <chr>

Filter news that mention “uncertain” or “risk” or “option” or “down”,

news_frame |> 
  filter(str_detect(str_to_lower(description), "uncertain|risk|option|down")) |> 
  select(author, title, description)
# A tibble: 12 × 3
   author                  title                                     description
   <chr>                   <chr>                                     <chr>      
 1 Editor                  Interview 1999 – Gold Rush as Dollar Cra… "Podcast: …
 2 Kurt Zindulka           Half of British Voters Want Prime Minist… "Half of B…
 3 Glenn Carle             FO Exclusive: Global Lightning Roundup o… "Editor-in…
 4 Bloomberg News          Charting the Global Economy: ECB Holds, … "The Europ…
 5 Juliana Kim             DVDs and public transit: Boycott drives … "A sweepin…
 6 Rafael Nam              Trump promised a crypto revolution. So w… "Trump got…
 7 The White Coat Investor 13 Reasons I Still Own Bonds              "For some …
 8 Reuters                 Iran's surging crypto activity draws US … "Crypto us…
 9 Jake Simmons            Kevin Warsh Will Trigger Bitcoin Regime … "Bitcoin’s…
10 James Halver            Mining Stocks And Asian Markets Hit As B… "Bitcoin’s…
11 Everygame Casino        Super Bowl Betting Promos: Everygame's L… "Everygame…
12 Bovada                  Super Bowl Betting Sites: Bovada's Welco… "An inform…

You can make sentiment polarity with simple lexicon matching:

positive_words <- c("gain", "rally", "beat", "surge", "growth", "record", "optimism", "strong")
negative_words <- c("loss", "fall", "miss", "drop", "decline", "weak", "concern", "crisis")

news_frame |> 
  mutate(
    text = str_to_lower(title),
    pos = str_count(text, str_c("\\b(", str_c(positive_words, collapse = "|"), ")\\b")),
    neg = str_count(text, str_c("\\b(", str_c(negative_words, collapse = "|"), ")\\b")),
    sentiment_score = pos - neg
  )
# A tibble: 96 × 13
   author title description url   url_to_image published_at        content id   
   <chr>  <chr> <chr>       <chr> <chr>        <dttm>              <chr>   <chr>
 1 Axal … Ital… "Italy Ban… http… https://med… 2026-02-07 20:20:35 "&lt;d… <NA> 
 2 Waqas  Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA> 
 3 Oluwa… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA> 
 4 Editor Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA> 
 5 Diana… How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA> 
 6 Edito… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA> 
 7 Garet… As t… "As the We… http… https://liv… 2026-02-07 19:31:07 "Is Au… abc-…
 8 Quent… Afte… "The specu… http… https://s.y… 2026-02-07 19:30:00 "Galax… <NA> 
 9 Bloom… Tech… "The bigge… http… https://sma… 2026-02-07 19:29:26 "(Bloo… fina…
10 Joe W… Prof… "\"Sharp b… http… https://fut… 2026-02-07 19:15:00 "Follo… <NA> 
# ℹ 86 more rows
# ℹ 5 more variables: name <chr>, text <chr>, pos <int>, neg <int>,
#   sentiment_score <int>

Exercises

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

    • Find all words that start or end with x.
    • Find all words that start with a vowel and end with a consonant.
    • Are there any words that contain at least one of each different vowel?
  2. colors() contains a number of modifiers like “light”, “dark”, “medium” as in “lightgray” and “darkblue”. How could you automatically identify these modifiers?

  • Think about how you might detect and then remove the colors that are modified.

Large Language Models

LLMs in Finance

The use of LLMs in financial data analysis can be very effective.

  • Financial news and transcripts summary
  • Cleaning unstructured filings
  • Sentiment, and Q&A
  • Code assistant

LLM Deployments in R

Cloud LLMs

  • ellmer supports cloud LLM backends
  • Requires API key variables (e.g., OPENAI_API_KEY)
  • Pros: Zero setup, scalable
  • Cons: Costs, privacy risks

Local LLMs

  • ollamar and mall package
  • ellmer also supports local
  • Pros: No data leakage, offline
  • Cons: Hardware limits, setup

Package: ellmer

ellmer connects to cloud/local LLMs.

  • Supports multiple providers
    • OpenAI, Gemini, Claude, Groq, Ollama (local), Deekseek, Perplexity, …
# install.packages("ellmer")
library(ellmer)
library(tidyverse)

Prep: Google Gemini

Google gemini provides free tier APIs.

  1. Visit https://aistudio.google.com/

  2. Get the API key

  3. Set the chat machine

chat <- chat_google_gemini(api_key = "YOUR_API_KEY")
  1. Test the chat machine:
chat$chat("Hi!")
Hello! How can I help you today?

Interactive Console

You can use in interactive mode with live_console() or live_browser().

Test out yourselves:

live_console(chat)

Prompt Engineering

Core Principles for effective use of LLMs:

  1. Give the Role & Context

  2. Source Delimiters

  3. Explicit Output Format

  4. Determinism Controls

Principle 1: Role + Context

Define LLM’s role and financial context.

  • Aligns responses with domain expertise.

Examples

  • Specify role: “You are a sell-side equity analyst.”
  • Add context: “Focus on tech stocks in a bullish market.”

Principle 2: Source Delimiters

Wrap input text (news, filings) in triple back-ticks to clarity input boundaries.

Examples

  • In prompt: “Analyze the news news_frame$content[[1]]
prompt <- "Analyze sentiment: positive, negative, neutral:\n```{news_frame$content[1]}```"
chat$chat(prompt)
I'm sorry, I cannot access the content of `news_frame$content[1]` directly. As 
an AI model, I don't have the capability to execute code or access external 
dataframes from your local environment.

Could you please provide the actual text you'd like me to analyze? Once you 
paste the text, I'll be happy to give you a sentiment analysis!

Principle 3: Explicit Output Format

Request structured outputs for your analysis. When multiple answers are expected, JSON format is recommended.

Examples

  • “Strictly answer Yes or No.”
  • “Return JSON: {sentiment: value, confidence: 0-1}””

Principle 4: Determinism Controls

Set temperature = 0 to ensure consistent responses and limit token budget to control costs.

Note

Temperature in LLM is a paramter that controls randomness (0-2 range). Low level (0) gives consistent, predictable and rigid outputs. High level (1) gives create and varied responses.

Tokens are similar to word counts, that measures the weight of the information in input/output text.

Tune and Build the LLM machine

Let’s setup a financial news analyzer machine with prompt engineering, as example below.

news_analyzer <- chat_google_gemini(
  system_prompt = r"{
  You are an expert financial analyst. 
  You will be provided news article title to analyze, which will be wrapped with tripple backticks ```.
  Your task is to assess the market sentiment of a news article.
  Return valid JSON with curly braces without any other formatting:  
  – "score": a real number between [0, 1] (0 = extremely negative, 1 = extremely positive).  
  – "rationale": less than 25 words.  
  Do not add any keys, text, or commentary outside the JSON object.  
  }",
  # api_key = "Your_API_KEY",
  api_args = list(
    generationConfig = list(
      temperature = 0,
      maxOutputTokens = 100 
    )
  )
)

Prep: News data

Prepare news dataframe from newsanchor.

library(newsanchor)
news_frame <- get_everything(
  "financial markets",
  language = "en",
  page = 1,
  # api_key = your_api_key,
)[[2]]
news_frame <- news_frame |> 
  select(author, published_at, title, content) |> 
  head(10)

Run on single Document

As a test run:

test_ans <- news_analyzer$chat(str_glue("Tell me the sentiment of this article: ```{news_frame$content[[1]]}```"))
```json
print(test_ans)
```json

Clean output with Regex

Since the output always contains json markdown formatter, we can clean with regex.

# Remove \n
test_ans <- str_replace_all(test_ans, "\\n", "")
# Remove markdown formatters
test_ans <- str_replace(
      test_ans,
      "^```json(.*)```$",  # capturing group
      "\\1"
    )
print(test_ans)

Parsing JSON formats

jsonlite package pases json format strings.

library(jsonlite)
fromJSON(test_ans) # Parse as list object

Deploy model on data

Now, we can analyze sentiment of financial news titles.

Step 1: Build a function that

  • Reads title and
  • Generates LLM results
  • Cleans it
get_sentiment <- function(title){
  prompt <- str_glue("Tell me the sentiment of this article: ```{title}```")
  llm_response <- news_analyzer$chat(prompt, echo = "none")
  clean_response <- 
    str_replace_all(llm_response, "\\n", "") |> 
    str_replace(
      "^```json(.*)```$",  # capturing group
      "\\1"
    ) |> 
    fromJSON()
  return(clean_response)
}

Step 2: Map the function

news_frame <- news_frame |> 
  mutate(
    response = map(title, get_sentiment)
  )
news_frame |> head(3)

Step 3: Tidy the data (unnest_wider())

news_frame <- news_frame |> 
  unnest_wider(response)
news_frame |> head(3)  

Exercise

Collect and analyze 10 financial news articles for market sentiment.

  • Choose a financial topic (e.g., “stock market”, “cryptocurrency”).
  • Collect 10 articles using newsanchor with above topic.
  • Replicate class examples to generate sentiment scores and rationales.